Tutorial on High Performance Computing with Python

This tutorial will go over some popular techniques for getting the best performance out of your python code, including Numpy, Threading, Multiprocessing, and Just-In-Time compilation. Emphasis is placed in CPU performance and will use the example of speeding up a numerical double integral.

Example for this Tutorial: Coulomb Interaction on Excited States

The example that we will use is the Coulomb coupling between two excited state monomers. This coupling is used to estimate the collective excitation energies of the combined dimer system, while only computing the excitation energies and transition densities of each monomer separately.

For those interested in the topic, I suggest taking a look at "Charge and Energy Transfer Dynamics in Molecular Systems" by Prof. Dr. Oliver Kühn.

If you're not familiar with excitation energy calculations, that's OK! All you need to be concerned with is the idea of computing a numerical double integral, which we will use as the running example in this tutorial.

Consider two molecules, each with excitation energies $E_1$ and $E_2$ and transition densities $\rho_{eg}(\mathbf{r}1)$ and $\rho{eg}(\mathbf{r}2)$.

When combined, the supramolecular system have will have new excited states with a splitting $2\Delta$

This splitting depends on their Coulomb coupling between their transition densities $\rho{eg}$,

This can be approximated by dividing the density into a grid and summing over each "cube" density,

The following sections of this tutorial will demonstrate various implementations of this summation, starting with the slowest and ending with the most efecient methods.

Importing the Cube Data

First we load in two transition density cube files for a cresyl-violet dimer.

Their geometries were oriented in a stacked configuration and optimized at the CAM-B3LYP/6-31G* level of theory with Gaussian 16.
MultiWfn was used to generate the cube densities from a Gaussian formatted checkpoint file.

Because some of these examples will take some time to run, we will use two versions of the data: one with the "regular" number of cube points and another with a reduced number of points.

Since the for loops scale by $\mathcal{O}(NM)$ with $N$ and $M$ being the number of points in the first and second cube densities, we can approximate the time it takes to run the regular sized dataset by multiplying by the ratio of the two dataset sizes.

from external import Cube # custom gaussian cube file reader
from os.path import *

data_1 = Cube.CubeData(join('CV_data', 'transdens_1_low.cub'))
data_2 = Cube.CubeData(join('CV_data', 'transdens_2_low.cub'))
dV_12 = data_1.dV * data_2.dV
data_1_L = Cube.CubeData(join('CV_data', 'transdens_1_extra_low.cub'))
data_2_L = Cube.CubeData(join('CV_data', 'transdens_2_extra_low.cub'))
dV_12_L = data_1_L.dV * data_2_L.dV

print("Dimension of coordinates in regular cubes:  ", data_1.coords.shape, data_2.coords.shape)
print("Dimension of density data in regular cubes: ", data_1.cube_data.shape, data_2.cube_data.shape)

print("Dimension of coordinates in reduced cubes:  ", data_1_L.coords.shape, data_2_L.coords.shape)
print("Dimension of density data in reduced cubes: ", data_1_L.cube_data.shape, data_2_L.cube_data.shape)

#   This ratio will be used as a means of estimating the time it takes to
#   run a portion of the code with the larger data set
point_ratio = data_1.n_points * data_2.n_points/(data_1_L.n_points * data_2_L.n_points)
print(f"Ratio of the two datasets: {point_ratio:.2f}")

#   will store all of our benchmarks as we procede through the tutorial
all_timers = {}

Reading cube file CV_data/transdens_1_low.cub
Formatting
Done
Reading cube file CV_data/transdens_2_low.cub
Formatting
Done
Reading cube file CV_data/transdens_1_extra_low.cub
Formatting
Done
Reading cube file CV_data/transdens_2_extra_low.cub
Formatting
Done
Dimension of coordinates in regular cubes:   (61617, 3) (66654, 3)
Dimension of density data in regular cubes:  (61617,) (66654,)
Dimension of coordinates in reduced cubes:   (2975, 3) (5220, 3)
Dimension of density data in reduced cubes:  (2975,) (5220,)
Ratio of the two datasets: 264.47

A First Attempt: Pure Python

The double summation is implemented below in what would be considered the most straight forward way possible:

Construct two for loops, one inner and one outer, for each cube densities.
Sum over the product of each cube divided by the distance between them. To save time, we will use the reduced cube files and estimate

from math import sqrt
import time

def calc_coulomb_pure_python(pts_1, rho_1, pts_2, rho_2, dV):
    '''
        Parameters
        ----------
        pts_1: (Nx3) list or array or cube coordinates
        rho_1: (Nx1) list or array or cube values
        pts_2: (Nx3) list or array or cube coordinates
        rho_1: (Nx1) list or array or cube values
    '''
    from math import sqrt   #   normally, this is imported beforehand, but we need to fix fomr Jupyter multiprocessing bugs (discussed later)
    total = 0.0
    n_pts_1 = len(pts_1)
    n_pts_2 = len(pts_2)
    print_num = n_pts_1//5
    for i in range(n_pts_1):
        for j in range(n_pts_2):
            
            #   grab the coordiantes and take their difference
            x1, y1, z1 = pts_1[i]
            x2, y2, z2 = pts_2[j]
            dx = x1 - x2
            dy = y1 - y2
            dz = z1 - z2

            #   distance between the two points
            r = sqrt(dx**2 + dy**2 + dz**2)

            #   update the summation total
            total += rho_1[i]*rho_2[j]/r

        #   print our progress
        if i % print_num == 0 or i == n_pts_1 - 1:
                print(f"Coulomb integral progress: {(i / n_pts_1*100):5.1f} %")

    return total*dV

#   Record the current time, run the code, and record the current time again.
#   The difference in the two times measures the execution time.
start = time.time()
total = calc_coulomb_pure_python(data_1_L.coords, data_1_L.cube_data, data_2_L.coords, data_2_L.cube_data, dV_12_L)
total_time = (time.time() - start)*point_ratio
all_timers['pure_python'] = total_time
print(f'pure_python*: {total_time:.2f} s ({total*AU_2_EV:.5} eV)')

Coulomb integral progress:   0.0 %
Coulomb integral progress:  20.0 %
Coulomb integral progress:  40.0 %
Coulomb integral progress:  60.0 %
Coulomb integral progress:  80.0 %
Coulomb integral progress: 100.0 %
pure_python*: 5057.54 s (0.23374 eV)

Numpy

Our first speed improvement will use the Numpy library.

In short, Numpy gives us a way to manipulate multiple data points, stored in arrays, at the same time.

Think of numpy arrays as mathematical vectors and matrices.

import numpy as np

a = np.array([1, 2, 3, 4, 5])
b = np.array([2, 3, 4, 5, 6])
c = np.array([[1, 2], [3, 4]])
d = np.array([[2, 0], [0, 3]])

#   scalar addition to a vector
print(f"{ a + 2 = }")
#   built in functions
print(f"{ np.sum(a) = }")
#   element-wise multiplication
print(f"{ a * b = }")
#   matrix-vector multiplication
print(f"{ c @ d = }")

 a + 2 = array([3, 4, 5, 6, 7])
 np.sum(a) = 15
 a * b = array([ 2,  6, 12, 20, 30])
 c @ d = array([[ 2,  6],
       [ 6, 12]])

When you run a line of python, you are actually running a bunch of compiled C++ code in the background

source: https://pythonextensionpatterns.readthedocs.io/en/latest/refcount.html
Each line must go through the Python interpreter to execute the correct PyObject and it's respective code.

Numpy functions are also implemented as compiled C-code, but each iteration within a function (np.sum, np.exp, etc.) is all done internally. This means the interpreter is only needed once for each numpy call!

The next implementation replaces the inner loop with pure Numpy functions. Note the huge speed up!

def calc_coulomb_numpy(pts_1, rho_1, pts_2, rho_2, dV):
    total = 0.0
    n_pts_1 = len(pts_1)
    print_num = n_pts_1//5
    for i in range(n_pts_1):
        if i % print_num == 0:
            print(f"    Coulomb Integral {(i / n_pts_1*100):.1f} %")

        dr = pts_1[i] - pts_2
        r = np.linalg.norm(dr, axis=1)
        total += rho_1[i]*np.sum(rho_2/r)

    return total*dV

start = time.time()
total = calc_coulomb_numpy(data_1_L.coords, data_1_L.cube_data, data_2_L.coords, data_2_L.cube_data, dV_12_L)
total_time = (time.time() - start)*point_ratio

#   depending on the estimated time with the reduced dataset, you may be able to uncomment 
#   out the next two lines and use the entire dataset if you are willing to wait.
# total = calc_coulomb_numpy(data_1.coords, data_1.cube_data, data_2.coords, data_2.cube_data, dV_12)
# total_time = (time.time() - start)

all_timers['numpy'] = total_time
print(f'numpy: {total_time:.2f} s ({total*AU_2_EV:.5f} a.u.)')

    Coulomb Integral 0.0 %
    Coulomb Integral 20.0 %
    Coulomb Integral 40.0 %
    Coulomb Integral 60.0 %
    Coulomb Integral 80.0 %
numpy: 62.18 s (0.23374 a.u.)

Multithreading

Modern CPUs contain multiple CPU cores, each acting as an individual CPU, but all connected to the same internal memory (either CPU cache or RAM). Each core can perform (roughly) the same amount of work as another core. Unless your code is told to do so, python will run on only one core at a time.

We tell python how to use these cores by setting up software "threads". Each thread is branch off of the main python program that does a separate amount of work and is assigned to one per CPU core each.

Technically, the number of threads are not limited by the number of cores, but with HPC, it's not beneficial to assign more. With some software, this can actually hurt your performance!

Implementation

We can split up our integral work by partitioning the outer loop (over molecule 1) into separate contributions of the total density.

Each sub-density will be assigned to a separate thread and will only affect the number of iterations in the outer for-loop.
The inner for-loop will remain the same.

And so on...

import threading

#   Shared memory that each thread has access to
thread_totals = np.array([])

def _coulomb_by_indix(indicies, pts_1, rho_1, pts_2, rho_2, dV, thread_ID):

    total = 0.0
    n_pts_1 = len(indicies)
    n_pts_2 = len(pts_2)
    print_num = n_pts_1//5
    for count, i in enumerate(indicies): # Now we loop over specified indicies only
        for j in range(n_pts_2):
            x1, y1, z1 = pts_1[i]
            x2, y2, z2 = pts_2[j]
            dx = x1 - x2
            dy = y1 - y2
            dz = z1 - z2
            r = sqrt(dx**2 + dy**2 + dz**2)
            total += rho_1[i]*rho_2[j]/r

        if count % print_num == 0:
            #   Also print thread ID with the progress
            print(f"    Coulomb Integral {thread_ID} {(count / n_pts_1*100):.1f} %")

    #   update the global integral totals
    thread_totals[thread_ID] = total*dV

def calc_coulomb_thread(n_threads, pts_1, rho_1, pts_2, rho_2, dV):
    global thread_totals
    all_threads = []
    #   initialize each thread's total to zero
    thread_totals = np.zeros(n_threads)
    for n in range(n_threads):
        #   these will be the indicies used by the inner Coulomb loop
        indicies = np.arange(n, len(pts_1), n_threads)
        print(f"Thread {n} using indicies ", *indicies[0:4], "...")
        
        thread = threading.Thread(
            target=_coulomb_by_indix, 
            args=(indicies, pts_1, rho_1, pts_2, rho_2, dV, n)
            )
        all_threads.append(thread)

        #   the function is not called until we start the thread
        thread.start()

    #   now wait for all the threads to complete
    for thread in all_threads:
        thread.join()

    #   The total Coulomb integral is equal to the sum of each thread's partial integral
    return np.sum(thread_totals)

start = time.time()
total = calc_coulomb_thread(4, data_1_L.coords, data_1_L.cube_data, data_2_L.coords, data_2_L.cube_data, dV_12_L)
# total = calc_coulomb_thread(4, data_1.coords, data_1.cube_data, data_2.coords, data_2.cube_data, dV_12)
total_time = (time.time() - start)*point_ratio
all_timers['pure_python_threaded'] = total_time
print(f'pure_python_threaded: {total_time:.2f} s ({total*AU_2_EV:.5f} a.u.)')
print(f'pure_python:          {all_timers["pure_python"]:.2f} s ')

Thread 0 using indicies  0 4 8 12 ...
    Coulomb Integral 0 0.0 %
Thread 1 using indicies  1 5 9 13 ...
Thread 2 using indicies  2 6 10 14 ...
    Coulomb Integral 1 0.0 %
    Coulomb Integral 2 0.0 %
Thread 3 using indicies  3 7 11 15 ...
    Coulomb Integral 3 0.0 %
    Coulomb Integral 1 19.9 %
    Coulomb Integral 0 19.9 %
    Coulomb Integral 3 19.9 %
    Coulomb Integral 2 19.9 %
    Coulomb Integral 3 39.8 %    Coulomb Integral 1 39.8 %

    Coulomb Integral 0 39.8 %
    Coulomb Integral 2 39.8 %
    Coulomb Integral 3 59.8 %
    Coulomb Integral 1 59.7 %
    Coulomb Integral 0 59.7 %
    Coulomb Integral 2 59.7 %
    Coulomb Integral 3 79.7 %
    Coulomb Integral 1 79.6 %
    Coulomb Integral 0 79.6 %
    Coulomb Integral 2 79.6 %
    Coulomb Integral 1 99.5 %
    Coulomb Integral 3 99.6 %
    Coulomb Integral 0 99.5 %
    Coulomb Integral 2 99.5 %
pure_python_threaded: 5118.22 s (0.23374 a.u.)
pure_python:          5057.54 s

Global Interpreter Lock

You should have noticed that the above code didn't run any faster than the original pure-python function defined above. This is because every line in the function must obtain the Global Interpreter Lock (GIL) before it can be executed. The GIL is used to ensure that only one thread can access data at the same time. When two threads attempt to access the same memory, we call that a race-condition, and can lead to unpredictable results.

#   obtain GIL
a = 5
#   release GIL
#   obtain GIL
print(a)
#   release GIL

Using Numpy with Threading

Instead, we can replace the inner for-loop using Numpy routines, which release the GIL once executed. This allows another python line to be run while Numpy is handling the numerical heavy lifting.

import threading

thread_totals = np.array([])

def _coulomb_by_indix_2(indicies, pts_1, rho_1, pts_2, rho_2, dV, thread_ID):
    total = 0.0
    n_pts_1 = len(indicies)
    print_num = n_pts_1//5
    for count, i in enumerate(indicies): # EDIT: loop over specified indicies only
        if count % print_num == 0:
            print(f"    Coulomb Integral {thread_ID} {(count / n_pts_1*100):.1f} %")

        dr = pts_2 - pts_1[i]
        r = np.linalg.norm(dr, axis=1)
        total += rho_1[i]*np.sum(rho_2/r)

    thread_totals[thread_ID] = total*dV

def calc_coulomb_thread(n_threads, pts_1, rho_1, pts_2, rho_2, dV):
    global thread_totals
    all_threads = []
    thread_totals = np.zeros(n_threads)
    for n in range(n_threads):
        #   these will be the indicies used by the inner Coulomb loop
        indicies = np.arange(n, len(pts_1), n_threads)
        print(f"Thread {n} using indicies ", *indicies[0:4], "...")
        
        thread = threading.Thread(
            target=_coulomb_by_indix_2, 
            args=(indicies, pts_1, rho_1, pts_2, rho_2, dV, n)
            )
        all_threads.append(thread)

        #   the function is not called until we start the thread
        thread.start()

    #   now wait for all the threads to complete
    for thread in all_threads:
        thread.join()

    #   The total Coulomb integral is equal to the sum of each thread's partial integral
    return np.sum(thread_totals)

start = time.time()
total = calc_coulomb_thread(4, data_1.coords, data_1.cube_data, data_2.coords, data_2.cube_data, dV_12)
total_time = (time.time() - start)
all_timers['numpy_threaded'] = total_time
print(f'numpy_threaded: {total_time:.2f} s ({total*AU_2_EV:.5f} a.u.)')
print(f'numpy:          {all_timers["numpy"]:.2f} s ')
print(f'speed-up:       {all_timers["numpy"]/all_timers["numpy_threaded"]:.1f}x')

Thread 0 using indicies  0 4 8 12 ...
    Coulomb Integral 0 0.0 %
Thread 1 using indicies  1 5 9 13 ...
    Coulomb Integral 1 0.0 %
Thread 2 using indicies  2 6 10 14 ...
    Coulomb Integral 2 0.0 %
Thread 3 using indicies  3 7 11 15 ...
    Coulomb Integral 3 0.0 %
    Coulomb Integral 0 20.0 %
    Coulomb Integral 2 20.0 %
    Coulomb Integral 3 20.0 %
    Coulomb Integral 1 20.0 %
    Coulomb Integral 0 40.0 %
    Coulomb Integral 3 40.0 %
    Coulomb Integral 1 40.0 %
    Coulomb Integral 2 40.0 %
    Coulomb Integral 3 60.0 %
    Coulomb Integral 0 60.0 %
    Coulomb Integral 1 60.0 %
    Coulomb Integral 2 60.0 %
    Coulomb Integral 3 80.0 %
    Coulomb Integral 1 80.0 %
    Coulomb Integral 0 80.0 %
    Coulomb Integral 2 80.0 %
    Coulomb Integral 3 100.0 %
    Coulomb Integral 1 100.0 %
    Coulomb Integral 2 100.0 %
numpy_threaded: 24.97 s (0.24229 a.u.)
numpy:          62.18 s 
speed-up:       2.5x

Multiprocessing

Similar to threading, multiprocessing also partitions the work onto multiple computing kernels termed processes.

The main difference here is that each process is a separate python instance. As each process is (mostly) independent from each other, they will each need a copy of the data being summed over.

There are ways to share data between each process using Queues, but this will not be covered here.

You can observe each Python instance running by monitoring you system with top bash command. As it runs, you should see something like this:

Notice that each process uses about the name amount of RAM. Keep this in mind if your datasets are very large or use many, many processes!

import multiprocess as mp # use with Jupyter Notebooks
#import multiprocess as mp # use with traditional python files

def calc_coulomb_MP(n_proc, pts_1, rho_1, pts_2, rho_2, dV):
    #   outer loop will be split by each process
    pts_1_split = np.array_split(pts_1, n_proc)
    rho_1_split = np.array_split(rho_1, n_proc)

    #   inner loop will remain the same, so we simply copy the data
    pts_2_copies = [pts_2]*n_proc
    rho_2_copies = [rho_2]*n_proc

    #   a copy also needs to be supplied to each process
    dV_list = [dV]*n_proc

    with mp.Pool(n_proc) as pool:
        func_params = zip(pts_1_split, rho_1_split, pts_2_copies, rho_2_copies, dV_list)
        # results = pool.starmap(calc_coulomb_pure_python, func_params)
        results = pool.starmap(calc_coulomb_numpy, func_params)

    return np.sum(results)

start = time.time()
N_PROCESS = 4
#   use the following lines instead if you want a shorter runtime
# total = calc_coulomb_MP(4, data_1_L.coords, data_1_L.cube_data, data_2_L.coords, data_2_L.cube_data, dV_12_L)
# total_time = (time.time() - start)*point_ratio

total = calc_coulomb_MP(N_PROCESS, data_1.coords, data_1.cube_data, data_2.coords, data_2.cube_data, dV_12)
total_time = (time.time() - start)

all_timers['multiprocess_numpy'] = total_time
print(f'multiprocess_numpy: {total_time:.2f} s ({total*AU_2_EV:.5f} a.u.)')
print(f'numpy:             {all_timers["numpy"]:.2f} s ')
print(f'speed-up:          {all_timers["numpy"]/all_timers["multiprocess_numpy"]:.1f}x')

    Coulomb Integral 0.0 %


    Coulomb Integral 0.0 %
    Coulomb Integral 0.0 %    Coulomb Integral 0.0 %

    Coulomb Integral 20.0 %
    Coulomb Integral 20.0 %
    Coulomb Integral 20.0 %
    Coulomb Integral 20.0 %
    Coulomb Integral 40.0 %
    Coulomb Integral 40.0 %
    Coulomb Integral 40.0 %
    Coulomb Integral 40.0 %
    Coulomb Integral 60.0 %
    Coulomb Integral 60.0 %
    Coulomb Integral 60.0 %
    Coulomb Integral 60.0 %
    Coulomb Integral 80.0 %
    Coulomb Integral 80.0 %
    Coulomb Integral 80.0 %
    Coulomb Integral 80.0 %
    Coulomb Integral 100.0 %
    Coulomb Integral 100.0 %
    Coulomb Integral 100.0 %
multiprocess_numpy: 15.09 s (0.24229 a.u.)
numpy:             62.18 s 
speed-up:          4.1x

Pros and Cons of Multiprocessing over Threading

Not dependent on shared memory, no race conditions as with threading.

Each process needs a copy of the data it will work on.
Starting up a process can take some time, so it shouldn't be done frequently.

Note: Slurm Settings

If using multiprocessing on a Slurm controlled cluster, submit your job with the #SBATCH -c N and not #SBATCH -n N to control the number of CPU cores to use.

While multiprocessing starts multiple instances of python, they still originated from a single parent process, thus Slurm sees it as only one process but with many CPU cores being used.

External Multiprocessing

Jupyter notebooks don't always play nicely with the multiprocessing library, and multiprocess, which is supposed to assist with these errors, isn't always perfect either. Multiprocessing is really designed to be used with .py files, so the implementation has also been placed in external/mp_external.py.

Instead of running in a Jupyter cell, this version calls python again on mp_external.py.
This is not typically done, but will suffice to circumvent any Jupyter issues.

import subprocess

command = 'python external/mp_external.py'
#   call python as a external program. This is VERY unorthodox and not usually recommended
subprocess.call(command.split())

total_time, total = np.loadtxt('_mp_external.txt')
all_timers['multiprocess_numpy'] = total_time
print(f'multiprocess_numpy: {total_time:.2f} s ({total*AU_2_EV:.5f} a.u.)')
print(f'numpy:             {all_timers["numpy"]:.2f} s ')
print(f'speed-up:          {all_timers["numpy"]/all_timers["multiprocess_numpy"]:.1f}x')

Reading cube file CV_data/transdens_1_low.cub
Formatting
Done
Reading cube file CV_data/transdens_2_low.cub
Formatting
Done
Reading cube file CV_data/transdens_1_extra_low.cub
Formatting
Done
Reading cube file CV_data/transdens_2_extra_low.cub
Formatting
Done
Number of points in regular cubes:  (61617, 3) (66654, 3)
                                    (61617,) (66654,)
Number of points in reduced cubes:  (2975, 3) (5220, 3)
                                    (2975,) (5220,)
Point ratio:  264.46566328600403
    Coulomb Integral 0.0 %
    Coulomb Integral 0.0 %
    Coulomb Integral 0.0 %
    Coulomb Integral 0.0 %
    Coulomb Integral 20.0 %
    Coulomb Integral 20.0 %
    Coulomb Integral 20.0 %
    Coulomb Integral 20.0 %
    Coulomb Integral 40.0 %
    Coulomb Integral 40.0 %
    Coulomb Integral 40.0 %
    Coulomb Integral 40.0 %
    Coulomb Integral 60.0 %
    Coulomb Integral 60.0 %
    Coulomb Integral 60.0 %
    Coulomb Integral 60.0 %
    Coulomb Integral 80.0 %
    Coulomb Integral 80.0 %
    Coulomb Integral 80.0 %
    Coulomb Integral 80.0 %
    Coulomb Integral 100.0 %
    Coulomb Integral 100.0 %
    Coulomb Integral 100.0 %
multiprocess_numpy: 18.88 s (0.24228723350023043 a.u.)
Exting external program


multiprocess_numpy: 18.88 s (0.24229 a.u.)
numpy:             62.18 s 
speed-up:          3.3x

Numba

For functions that are mostly numerical (like our Coulomb integral), numba can be used to perform just-in-time compilation.

As soon as the function is called, it is compiled with into machine code and run as any other self-contained python function.
Because the compilation can take some time, benchmarks should be run after the function is called at least once.

import numba

#   numba decorator signials to compile the code
#   fastmath=True enables SIMD vectorization
@numba.jit(nopython=True, fastmath=True)   
def calc_coulomb_numba(pts_1, rho_1, pts_2, rho_2, dV):
    total = 0.0
    n_pts_1 = len(pts_1)
    n_pts_2 = len(pts_2)
    for i in numba.prange(n_pts_1):
        x1, y1, z1 = pts_1[i]
        for j in range(n_pts_2):
            
            x2, y2, z2 = pts_2[j]
            dx = x1 - x2
            dy = y1 - y2
            dz = z1 - z2
            r = sqrt(dx*dx + dy*dy + dz*dz)
            total += rho_1[i]*rho_2[j]/r
            
    return total*dV

start = time.time()
total = calc_coulomb_numba(data_1.coords, data_1.cube_data, data_2.coords, data_2.cube_data, dV_12)
total_time = (time.time() - start)
all_timers['numba'] = total_time
print(f'numba:  {total_time:8.2f} s ({total*AU_2_EV:.5f} a.u.)')
print(f'pure_python: {all_timers["pure_python"]:8.2f} s')
print(f'speed-up:    {all_timers["pure_python"]/all_timers["numba"]:8.1f}x')

numba:      5.87 s (0.24229 a.u.)
pure_python:  5057.54 s
speed-up:       861.1x

Numba Parallelization

Numba can also be used for multithreaded workloads. Simply add the parallel=True argument to the jit decorator and replace range with numba.prange. This last part is crutial, as it tells Numba where to parallelize the loop.

#   notice the parallel=True argument
@numba.jit(nopython=True, fastmath=True, parallel=True)
def calc_coulomb_numba_parallel(pts_1, rho_1, pts_2, rho_2, dV):
    total = 0.0
    n_pts_1 = len(pts_1)
    n_pts_2 = len(pts_2)
    for i in numba.prange(n_pts_1): # prange tells numba where to parallelize the loop
        for j in range(n_pts_2):

            x1, y1, z1 = pts_1[i]
            x2, y2, z2 = pts_2[j]
            dx = x1 - x2
            dy = y1 - y2
            dz = z1 - z2
            r = sqrt(dx**2 + dy**2 + dz**2)
            total += rho_1[i]*rho_2[j]/r
            
    return total*dV

start = time.time()
numba.set_num_threads(4)
total = calc_coulomb_numba_parallel(data_1.coords, data_1.cube_data, data_2.coords, data_2.cube_data, dV_12)
total_time = (time.time() - start)
all_timers['numba_parallel'] = total_time
print(f'numba_parallel: {total_time:.2f} s ({total*AU_2_EV:.5f} a.u.)')
print(f'numba:          {all_timers["numba"]:.2f} s ({total*AU_2_EV:.5f} a.u.)')
print(f'speed-up:  {all_timers["numba"]/all_timers["numba_parallel"]:8.1f}x')

numba_parallel: 1.57 s (0.24229 a.u.)
numba:          5.87 s (0.24229 a.u.)
speed-up:       3.7x

Combination of Numba, Threading, and Multiprocessing

Multiple forms of optimization can also be combined!

For example, we can call the parallel Numba compiled code above with multiple threads, but also spread each set of threaded work over multiple processes.

I'm using 4 threads and two processes, but this can be tweaked based on your available hardware.

def calc_coulomb_MP_numba(n_proc, pts_1, rho_1, pts_2, rho_2, dV):
    #   outer loop will be split by each process
    pts_1_split = np.array_split(pts_1, n_proc)
    rho_1_split = np.array_split(rho_1, n_proc)

    #   inner loop will remain the same, so we simply copy the data
    pts_2_copies = [pts_2]*n_proc
    rho_2_copies = [rho_2]*n_proc

    #   a copy also needs to be supplied to each process
    dV_list = [dV]*n_proc

    with mp.Pool(n_proc) as pool:
        func_params = zip(pts_1_split, rho_1_split, pts_2_copies, rho_2_copies, dV_list)
        #   call the numba version with multiple threads
        results = pool.starmap(calc_coulomb_numba_parallel, func_params)

    return np.sum(results)

N_PROCESS = 2
N_THREADS = 4
start = time.time()
numba.set_num_threads(N_THREADS)    #   set the number of threads
total = calc_coulomb_MP_numba(N_PROCESS, data_1.coords, data_1.cube_data, data_2.coords, data_2.cube_data, dV_12)
total_time = (time.time() - start)
all_timers['multiproc + numba_parallel'] = total_time
print(f'multiproc + numba_parallel: {total_time:.2f} s ({total*AU_2_EV:.5f} a.u.)')

multiproc + numba_parallel: 0.83 s (0.24229 a.u.)

External code version

Again, due to Jupyter not cooperating with multiprocessing, we have an external version as well

import subprocess

command = 'python external/combo_external.py'
#   call python as a external program. This is VERY unorthodox and not usually recommended
subprocess.call(command.split())

total_time, total = np.loadtxt('_combo_external.txt')
all_timers['multiproc + numba_parallel'] = total_time
print(f'multiproc + numba_parallel: {total_time:.2f} s ({total*AU_2_EV:.5f} a.u.)')

Why Combine Threading with Multi-Processsing?

Not all threaded applications will scale perfectly, especially when many cores are involved! Depending on the aplication, threaded applications may platau with a small number of cores.

In the above example, 7 or 8 threads seem to be the maximum amount that should be used to squeeze out extra threading performance. However, let's also say that you have a total of 32 cores to use! Instead of forcing the application to run on 32 threads, you could run 4 processes, each with 8 threads.

Summary of Results

print(f'{"Method":28s} {"Time":>7s}')
print('--------------------------------------')
for name, val in all_timers.items():
    print(f'{name:28s} {val:7.2f} s')
print('--------------------------------------')

Method                          Time
--------------------------------------
pure_python                  5057.54 s
pure_python_threaded         5118.22 s
numpy_threaded                 24.97 s
numpy                          62.18 s
multiprocess_numpy             18.88 s
numba                           5.87 s
numba_parallel                  1.57 s
multiproc + numba_parallel      0.83 s
--------------------------------------

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
CV_data		CV_data
external		external
images		images
.gitignore		.gitignore
README.md		README.md
tutorial.ipynb		tutorial.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Tutorial on High Performance Computing with Python

TOC

Requirements

Example for this Tutorial: Coulomb Interaction on Excited States

Importing the Cube Data

A First Attempt: Pure Python

Numpy

Multithreading

Implementation

Global Interpreter Lock

Using Numpy with Threading

Multiprocessing

Pros and Cons of Multiprocessing over Threading

Note: Slurm Settings

External Multiprocessing

Numba

Numba Parallelization

Combination of Numba, Threading, and Multiprocessing

External code version

Why Combine Threading with Multi-Processsing?

Summary of Results

About

Uh oh!

Releases

Packages

Languages

ChristopherAMyers/Python_HPC_Tutorial

Folders and files

Latest commit

History

Repository files navigation

Tutorial on High Performance Computing with Python

TOC

Requirements

Example for this Tutorial: Coulomb Interaction on Excited States

Importing the Cube Data

A First Attempt: Pure Python

Numpy

Multithreading

Implementation

Global Interpreter Lock

Using Numpy with Threading

Multiprocessing

Pros and Cons of Multiprocessing over Threading

Note: Slurm Settings

External Multiprocessing

Numba

Numba Parallelization

Combination of Numba, Threading, and Multiprocessing

External code version

Why Combine Threading with Multi-Processsing?

Summary of Results

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages