This tutorial will go over some popular techniques for getting the best performance out of your python code, including Numpy, Threading, Multiprocessing, and Just-In-Time compilation. Emphasis is placed in CPU performance and will use the example of speeding up a numerical double integral.
- Requirements
- Example for this Tutorial: Coulomb Interaction on Excited States
- A First Attempt: Pure Python
- Numpy
- Multithreading
- Multiprocessing
- Numba
- Combination of Numba, Threading, and Multiprocessing
- Summary of Results
First, let's make sure we have the needed packages installed. To run the tutorial, you will need to have Python and Jupyter installed. Instructions for how to do so on all platforms, including Windows, MacOS, and Linux, can be found here.
In addition, the tutorial will leverage the following pakages:
- conda install python=3.10
- pip install numpy
- pip install numba
- pip install multiprocess
Or, you can pip install directly here:
! pip install numba
! pip install multiprocess
! pip install numpy
Let's make sure it can all be imported. We'll also set come constants import other packages needed for this tutorial, too.
import numpy as np
import time
import numba
import multiprocess
from math import sqrt
AU_2_EV = 27.211396 # convert from atomic units to electron volts
The example that we will use is the Coulomb coupling between two excited state monomers. This coupling is used to estimate the collective excitation energies of the combined dimer system, while only computing the excitation energies and transition densities of each monomer separately.
For those interested in the topic, I suggest taking a look at "Charge and Energy Transfer Dynamics in Molecular Systems" by Prof. Dr. Oliver KĂĽhn.
If you're not familiar with excitation energy calculations, that's OK! All you need to be concerned with is the idea of computing a numerical double integral, which we will use as the running example in this tutorial.
Consider two molecules, each with excitation energies
When combined, the supramolecular system have will have new excited states with a splitting $2\Delta$
This splitting depends on their Coulomb coupling between their transition densities $\rho{eg}$,
This can be approximated by dividing the density into a grid and summing over each "cube" density,
The following sections of this tutorial will demonstrate various implementations of this summation, starting with the slowest and ending with the most efecient methods.
First we load in two transition density cube files for a cresyl-violet dimer.
- Their geometries were oriented in a stacked configuration and optimized at the CAM-B3LYP/6-31G* level of theory with Gaussian 16.
- MultiWfn was used to generate the cube densities from a Gaussian formatted checkpoint file.
Because some of these examples will take some time to run, we will use two versions of the data: one with the "regular" number of cube points and another with a reduced number of points.
Since the for loops scale by
from external import Cube # custom gaussian cube file reader
from os.path import *
data_1 = Cube.CubeData(join('CV_data', 'transdens_1_low.cub'))
data_2 = Cube.CubeData(join('CV_data', 'transdens_2_low.cub'))
dV_12 = data_1.dV * data_2.dV
data_1_L = Cube.CubeData(join('CV_data', 'transdens_1_extra_low.cub'))
data_2_L = Cube.CubeData(join('CV_data', 'transdens_2_extra_low.cub'))
dV_12_L = data_1_L.dV * data_2_L.dV
print("Dimension of coordinates in regular cubes: ", data_1.coords.shape, data_2.coords.shape)
print("Dimension of density data in regular cubes: ", data_1.cube_data.shape, data_2.cube_data.shape)
print("Dimension of coordinates in reduced cubes: ", data_1_L.coords.shape, data_2_L.coords.shape)
print("Dimension of density data in reduced cubes: ", data_1_L.cube_data.shape, data_2_L.cube_data.shape)
# This ratio will be used as a means of estimating the time it takes to
# run a portion of the code with the larger data set
point_ratio = data_1.n_points * data_2.n_points/(data_1_L.n_points * data_2_L.n_points)
print(f"Ratio of the two datasets: {point_ratio:.2f}")
# will store all of our benchmarks as we procede through the tutorial
all_timers = {}
Reading cube file CV_data/transdens_1_low.cub
Formatting
Done
Reading cube file CV_data/transdens_2_low.cub
Formatting
Done
Reading cube file CV_data/transdens_1_extra_low.cub
Formatting
Done
Reading cube file CV_data/transdens_2_extra_low.cub
Formatting
Done
Dimension of coordinates in regular cubes: (61617, 3) (66654, 3)
Dimension of density data in regular cubes: (61617,) (66654,)
Dimension of coordinates in reduced cubes: (2975, 3) (5220, 3)
Dimension of density data in reduced cubes: (2975,) (5220,)
Ratio of the two datasets: 264.47
The double summation is implemented below in what would be considered the most straight forward way possible:
- Construct two for loops, one inner and one outer, for each cube densities.
- Sum over the product of each cube divided by the distance between them. To save time, we will use the reduced cube files and estimate
from math import sqrt
import time
def calc_coulomb_pure_python(pts_1, rho_1, pts_2, rho_2, dV):
'''
Parameters
----------
pts_1: (Nx3) list or array or cube coordinates
rho_1: (Nx1) list or array or cube values
pts_2: (Nx3) list or array or cube coordinates
rho_1: (Nx1) list or array or cube values
'''
from math import sqrt # normally, this is imported beforehand, but we need to fix fomr Jupyter multiprocessing bugs (discussed later)
total = 0.0
n_pts_1 = len(pts_1)
n_pts_2 = len(pts_2)
print_num = n_pts_1//5
for i in range(n_pts_1):
for j in range(n_pts_2):
# grab the coordiantes and take their difference
x1, y1, z1 = pts_1[i]
x2, y2, z2 = pts_2[j]
dx = x1 - x2
dy = y1 - y2
dz = z1 - z2
# distance between the two points
r = sqrt(dx**2 + dy**2 + dz**2)
# update the summation total
total += rho_1[i]*rho_2[j]/r
# print our progress
if i % print_num == 0 or i == n_pts_1 - 1:
print(f"Coulomb integral progress: {(i / n_pts_1*100):5.1f} %")
return total*dV
# Record the current time, run the code, and record the current time again.
# The difference in the two times measures the execution time.
start = time.time()
total = calc_coulomb_pure_python(data_1_L.coords, data_1_L.cube_data, data_2_L.coords, data_2_L.cube_data, dV_12_L)
total_time = (time.time() - start)*point_ratio
all_timers['pure_python'] = total_time
print(f'pure_python*: {total_time:.2f} s ({total*AU_2_EV:.5} eV)')
Coulomb integral progress: 0.0 %
Coulomb integral progress: 20.0 %
Coulomb integral progress: 40.0 %
Coulomb integral progress: 60.0 %
Coulomb integral progress: 80.0 %
Coulomb integral progress: 100.0 %
pure_python*: 5057.54 s (0.23374 eV)
Our first speed improvement will use the Numpy library.
In short, Numpy gives us a way to manipulate multiple data points, stored in arrays, at the same time.
- Think of numpy arrays as mathematical vectors and matrices.
import numpy as np
a = np.array([1, 2, 3, 4, 5])
b = np.array([2, 3, 4, 5, 6])
c = np.array([[1, 2], [3, 4]])
d = np.array([[2, 0], [0, 3]])
# scalar addition to a vector
print(f"{ a + 2 = }")
# built in functions
print(f"{ np.sum(a) = }")
# element-wise multiplication
print(f"{ a * b = }")
# matrix-vector multiplication
print(f"{ c @ d = }")
a + 2 = array([3, 4, 5, 6, 7])
np.sum(a) = 15
a * b = array([ 2, 6, 12, 20, 30])
c @ d = array([[ 2, 6],
[ 6, 12]])
When you run a line of python, you are actually running a bunch of compiled C++ code in the background
source: https://pythonextensionpatterns.readthedocs.io/en/latest/refcount.html
Each line must go through the Python interpreter to execute the correct PyObject
and it's respective code.
Numpy functions are also implemented as compiled C-code, but each iteration within a function (np.sum, np.exp, etc.) is all done internally. This means the interpreter is only needed once for each numpy call!
The next implementation replaces the inner loop with pure Numpy functions. Note the huge speed up!
def calc_coulomb_numpy(pts_1, rho_1, pts_2, rho_2, dV):
total = 0.0
n_pts_1 = len(pts_1)
print_num = n_pts_1//5
for i in range(n_pts_1):
if i % print_num == 0:
print(f" Coulomb Integral {(i / n_pts_1*100):.1f} %")
dr = pts_1[i] - pts_2
r = np.linalg.norm(dr, axis=1)
total += rho_1[i]*np.sum(rho_2/r)
return total*dV
start = time.time()
total = calc_coulomb_numpy(data_1_L.coords, data_1_L.cube_data, data_2_L.coords, data_2_L.cube_data, dV_12_L)
total_time = (time.time() - start)*point_ratio
# depending on the estimated time with the reduced dataset, you may be able to uncomment
# out the next two lines and use the entire dataset if you are willing to wait.
# total = calc_coulomb_numpy(data_1.coords, data_1.cube_data, data_2.coords, data_2.cube_data, dV_12)
# total_time = (time.time() - start)
all_timers['numpy'] = total_time
print(f'numpy: {total_time:.2f} s ({total*AU_2_EV:.5f} a.u.)')
Coulomb Integral 0.0 %
Coulomb Integral 20.0 %
Coulomb Integral 40.0 %
Coulomb Integral 60.0 %
Coulomb Integral 80.0 %
numpy: 62.18 s (0.23374 a.u.)
Modern CPUs contain multiple CPU cores, each acting as an individual CPU, but all connected to the same internal memory (either CPU cache or RAM). Each core can perform (roughly) the same amount of work as another core. Unless your code is told to do so, python will run on only one core at a time.
We tell python how to use these cores by setting up software "threads". Each thread is branch off of the main python program that does a separate amount of work and is assigned to one per CPU core each.
- Technically, the number of threads are not limited by the number of cores, but with HPC, it's not beneficial to assign more. With some software, this can actually hurt your performance!
We can split up our integral work by partitioning the outer loop (over molecule 1) into separate contributions of the total density.
- Each sub-density will be assigned to a separate thread and will only affect the number of iterations in the outer for-loop.
- The inner for-loop will remain the same.
import threading
# Shared memory that each thread has access to
thread_totals = np.array([])
def _coulomb_by_indix(indicies, pts_1, rho_1, pts_2, rho_2, dV, thread_ID):
total = 0.0
n_pts_1 = len(indicies)
n_pts_2 = len(pts_2)
print_num = n_pts_1//5
for count, i in enumerate(indicies): # Now we loop over specified indicies only
for j in range(n_pts_2):
x1, y1, z1 = pts_1[i]
x2, y2, z2 = pts_2[j]
dx = x1 - x2
dy = y1 - y2
dz = z1 - z2
r = sqrt(dx**2 + dy**2 + dz**2)
total += rho_1[i]*rho_2[j]/r
if count % print_num == 0:
# Also print thread ID with the progress
print(f" Coulomb Integral {thread_ID} {(count / n_pts_1*100):.1f} %")
# update the global integral totals
thread_totals[thread_ID] = total*dV
def calc_coulomb_thread(n_threads, pts_1, rho_1, pts_2, rho_2, dV):
global thread_totals
all_threads = []
# initialize each thread's total to zero
thread_totals = np.zeros(n_threads)
for n in range(n_threads):
# these will be the indicies used by the inner Coulomb loop
indicies = np.arange(n, len(pts_1), n_threads)
print(f"Thread {n} using indicies ", *indicies[0:4], "...")
thread = threading.Thread(
target=_coulomb_by_indix,
args=(indicies, pts_1, rho_1, pts_2, rho_2, dV, n)
)
all_threads.append(thread)
# the function is not called until we start the thread
thread.start()
# now wait for all the threads to complete
for thread in all_threads:
thread.join()
# The total Coulomb integral is equal to the sum of each thread's partial integral
return np.sum(thread_totals)
start = time.time()
total = calc_coulomb_thread(4, data_1_L.coords, data_1_L.cube_data, data_2_L.coords, data_2_L.cube_data, dV_12_L)
# total = calc_coulomb_thread(4, data_1.coords, data_1.cube_data, data_2.coords, data_2.cube_data, dV_12)
total_time = (time.time() - start)*point_ratio
all_timers['pure_python_threaded'] = total_time
print(f'pure_python_threaded: {total_time:.2f} s ({total*AU_2_EV:.5f} a.u.)')
print(f'pure_python: {all_timers["pure_python"]:.2f} s ')
Thread 0 using indicies 0 4 8 12 ...
Coulomb Integral 0 0.0 %
Thread 1 using indicies 1 5 9 13 ...
Thread 2 using indicies 2 6 10 14 ...
Coulomb Integral 1 0.0 %
Coulomb Integral 2 0.0 %
Thread 3 using indicies 3 7 11 15 ...
Coulomb Integral 3 0.0 %
Coulomb Integral 1 19.9 %
Coulomb Integral 0 19.9 %
Coulomb Integral 3 19.9 %
Coulomb Integral 2 19.9 %
Coulomb Integral 3 39.8 % Coulomb Integral 1 39.8 %
Coulomb Integral 0 39.8 %
Coulomb Integral 2 39.8 %
Coulomb Integral 3 59.8 %
Coulomb Integral 1 59.7 %
Coulomb Integral 0 59.7 %
Coulomb Integral 2 59.7 %
Coulomb Integral 3 79.7 %
Coulomb Integral 1 79.6 %
Coulomb Integral 0 79.6 %
Coulomb Integral 2 79.6 %
Coulomb Integral 1 99.5 %
Coulomb Integral 3 99.6 %
Coulomb Integral 0 99.5 %
Coulomb Integral 2 99.5 %
pure_python_threaded: 5118.22 s (0.23374 a.u.)
pure_python: 5057.54 s
You should have noticed that the above code didn't run any faster than the original pure-python function defined above. This is because every line in the function must obtain the Global Interpreter Lock (GIL) before it can be executed. The GIL is used to ensure that only one thread can access data at the same time. When two threads attempt to access the same memory, we call that a race-condition, and can lead to unpredictable results.
# obtain GIL
a = 5
# release GIL
# obtain GIL
print(a)
# release GIL
5
Instead, we can replace the inner for-loop using Numpy routines, which release the GIL once executed. This allows another python line to be run while Numpy is handling the numerical heavy lifting.
import threading
thread_totals = np.array([])
def _coulomb_by_indix_2(indicies, pts_1, rho_1, pts_2, rho_2, dV, thread_ID):
total = 0.0
n_pts_1 = len(indicies)
print_num = n_pts_1//5
for count, i in enumerate(indicies): # EDIT: loop over specified indicies only
if count % print_num == 0:
print(f" Coulomb Integral {thread_ID} {(count / n_pts_1*100):.1f} %")
dr = pts_2 - pts_1[i]
r = np.linalg.norm(dr, axis=1)
total += rho_1[i]*np.sum(rho_2/r)
thread_totals[thread_ID] = total*dV
def calc_coulomb_thread(n_threads, pts_1, rho_1, pts_2, rho_2, dV):
global thread_totals
all_threads = []
thread_totals = np.zeros(n_threads)
for n in range(n_threads):
# these will be the indicies used by the inner Coulomb loop
indicies = np.arange(n, len(pts_1), n_threads)
print(f"Thread {n} using indicies ", *indicies[0:4], "...")
thread = threading.Thread(
target=_coulomb_by_indix_2,
args=(indicies, pts_1, rho_1, pts_2, rho_2, dV, n)
)
all_threads.append(thread)
# the function is not called until we start the thread
thread.start()
# now wait for all the threads to complete
for thread in all_threads:
thread.join()
# The total Coulomb integral is equal to the sum of each thread's partial integral
return np.sum(thread_totals)
start = time.time()
total = calc_coulomb_thread(4, data_1.coords, data_1.cube_data, data_2.coords, data_2.cube_data, dV_12)
total_time = (time.time() - start)
all_timers['numpy_threaded'] = total_time
print(f'numpy_threaded: {total_time:.2f} s ({total*AU_2_EV:.5f} a.u.)')
print(f'numpy: {all_timers["numpy"]:.2f} s ')
print(f'speed-up: {all_timers["numpy"]/all_timers["numpy_threaded"]:.1f}x')
Thread 0 using indicies 0 4 8 12 ...
Coulomb Integral 0 0.0 %
Thread 1 using indicies 1 5 9 13 ...
Coulomb Integral 1 0.0 %
Thread 2 using indicies 2 6 10 14 ...
Coulomb Integral 2 0.0 %
Thread 3 using indicies 3 7 11 15 ...
Coulomb Integral 3 0.0 %
Coulomb Integral 0 20.0 %
Coulomb Integral 2 20.0 %
Coulomb Integral 3 20.0 %
Coulomb Integral 1 20.0 %
Coulomb Integral 0 40.0 %
Coulomb Integral 3 40.0 %
Coulomb Integral 1 40.0 %
Coulomb Integral 2 40.0 %
Coulomb Integral 3 60.0 %
Coulomb Integral 0 60.0 %
Coulomb Integral 1 60.0 %
Coulomb Integral 2 60.0 %
Coulomb Integral 3 80.0 %
Coulomb Integral 1 80.0 %
Coulomb Integral 0 80.0 %
Coulomb Integral 2 80.0 %
Coulomb Integral 3 100.0 %
Coulomb Integral 1 100.0 %
Coulomb Integral 2 100.0 %
numpy_threaded: 24.97 s (0.24229 a.u.)
numpy: 62.18 s
speed-up: 2.5x
Similar to threading, multiprocessing
also partitions the work onto multiple computing kernels termed processes
.
The main difference here is that each process is a separate python instance. As each process is (mostly) independent from each other, they will each need a copy of the data being summed over.
- There are ways to share data between each process using Queues, but this will not be covered here.
You can observe each Python instance running by monitoring you system with top
bash command. As it runs, you should see something like this:
- Notice that each process uses about the name amount of RAM. Keep this in mind if your datasets are very large or use many, many processes!
import multiprocess as mp # use with Jupyter Notebooks
#import multiprocess as mp # use with traditional python files
def calc_coulomb_MP(n_proc, pts_1, rho_1, pts_2, rho_2, dV):
# outer loop will be split by each process
pts_1_split = np.array_split(pts_1, n_proc)
rho_1_split = np.array_split(rho_1, n_proc)
# inner loop will remain the same, so we simply copy the data
pts_2_copies = [pts_2]*n_proc
rho_2_copies = [rho_2]*n_proc
# a copy also needs to be supplied to each process
dV_list = [dV]*n_proc
with mp.Pool(n_proc) as pool:
func_params = zip(pts_1_split, rho_1_split, pts_2_copies, rho_2_copies, dV_list)
# results = pool.starmap(calc_coulomb_pure_python, func_params)
results = pool.starmap(calc_coulomb_numpy, func_params)
return np.sum(results)
start = time.time()
N_PROCESS = 4
# use the following lines instead if you want a shorter runtime
# total = calc_coulomb_MP(4, data_1_L.coords, data_1_L.cube_data, data_2_L.coords, data_2_L.cube_data, dV_12_L)
# total_time = (time.time() - start)*point_ratio
total = calc_coulomb_MP(N_PROCESS, data_1.coords, data_1.cube_data, data_2.coords, data_2.cube_data, dV_12)
total_time = (time.time() - start)
all_timers['multiprocess_numpy'] = total_time
print(f'multiprocess_numpy: {total_time:.2f} s ({total*AU_2_EV:.5f} a.u.)')
print(f'numpy: {all_timers["numpy"]:.2f} s ')
print(f'speed-up: {all_timers["numpy"]/all_timers["multiprocess_numpy"]:.1f}x')
Coulomb Integral 0.0 %
Coulomb Integral 0.0 %
Coulomb Integral 0.0 % Coulomb Integral 0.0 %
Coulomb Integral 20.0 %
Coulomb Integral 20.0 %
Coulomb Integral 20.0 %
Coulomb Integral 20.0 %
Coulomb Integral 40.0 %
Coulomb Integral 40.0 %
Coulomb Integral 40.0 %
Coulomb Integral 40.0 %
Coulomb Integral 60.0 %
Coulomb Integral 60.0 %
Coulomb Integral 60.0 %
Coulomb Integral 60.0 %
Coulomb Integral 80.0 %
Coulomb Integral 80.0 %
Coulomb Integral 80.0 %
Coulomb Integral 80.0 %
Coulomb Integral 100.0 %
Coulomb Integral 100.0 %
Coulomb Integral 100.0 %
multiprocess_numpy: 15.09 s (0.24229 a.u.)
numpy: 62.18 s
speed-up: 4.1x
Not dependent on shared memory, no race conditions as with threading.
Each process needs a copy of the data it will work on.
Starting up a process can take some time, so it shouldn't be done frequently.
If using multiprocessing
on a Slurm controlled cluster, submit your job with the #SBATCH -c N
and not #SBATCH -n N
to control the number of CPU cores to use.
- While multiprocessing starts multiple instances of python, they still originated from a single parent process, thus Slurm sees it as only one process but with many CPU cores being used.
Jupyter notebooks don't always play nicely with the multiprocessing
library, and multiprocess
, which is supposed to assist with these errors, isn't always perfect either. Multiprocessing is really designed to be used with .py
files, so the implementation has also been placed in external/mp_external.py
.
- Instead of running in a Jupyter cell, this version calls python again on
mp_external.py
. - This is not typically done, but will suffice to circumvent any Jupyter issues.
import subprocess
command = 'python external/mp_external.py'
# call python as a external program. This is VERY unorthodox and not usually recommended
subprocess.call(command.split())
total_time, total = np.loadtxt('_mp_external.txt')
all_timers['multiprocess_numpy'] = total_time
print(f'multiprocess_numpy: {total_time:.2f} s ({total*AU_2_EV:.5f} a.u.)')
print(f'numpy: {all_timers["numpy"]:.2f} s ')
print(f'speed-up: {all_timers["numpy"]/all_timers["multiprocess_numpy"]:.1f}x')
Reading cube file CV_data/transdens_1_low.cub
Formatting
Done
Reading cube file CV_data/transdens_2_low.cub
Formatting
Done
Reading cube file CV_data/transdens_1_extra_low.cub
Formatting
Done
Reading cube file CV_data/transdens_2_extra_low.cub
Formatting
Done
Number of points in regular cubes: (61617, 3) (66654, 3)
(61617,) (66654,)
Number of points in reduced cubes: (2975, 3) (5220, 3)
(2975,) (5220,)
Point ratio: 264.46566328600403
Coulomb Integral 0.0 %
Coulomb Integral 0.0 %
Coulomb Integral 0.0 %
Coulomb Integral 0.0 %
Coulomb Integral 20.0 %
Coulomb Integral 20.0 %
Coulomb Integral 20.0 %
Coulomb Integral 20.0 %
Coulomb Integral 40.0 %
Coulomb Integral 40.0 %
Coulomb Integral 40.0 %
Coulomb Integral 40.0 %
Coulomb Integral 60.0 %
Coulomb Integral 60.0 %
Coulomb Integral 60.0 %
Coulomb Integral 60.0 %
Coulomb Integral 80.0 %
Coulomb Integral 80.0 %
Coulomb Integral 80.0 %
Coulomb Integral 80.0 %
Coulomb Integral 100.0 %
Coulomb Integral 100.0 %
Coulomb Integral 100.0 %
multiprocess_numpy: 18.88 s (0.24228723350023043 a.u.)
Exting external program
multiprocess_numpy: 18.88 s (0.24229 a.u.)
numpy: 62.18 s
speed-up: 3.3x
For functions that are mostly numerical (like our Coulomb integral), numba can be used to perform just-in-time
compilation.
- As soon as the function is called, it is compiled with into machine code and run as any other self-contained python function.
- Because the compilation can take some time, benchmarks should be run after the function is called at least once.
import numba
# numba decorator signials to compile the code
# fastmath=True enables SIMD vectorization
@numba.jit(nopython=True, fastmath=True)
def calc_coulomb_numba(pts_1, rho_1, pts_2, rho_2, dV):
total = 0.0
n_pts_1 = len(pts_1)
n_pts_2 = len(pts_2)
for i in numba.prange(n_pts_1):
x1, y1, z1 = pts_1[i]
for j in range(n_pts_2):
x2, y2, z2 = pts_2[j]
dx = x1 - x2
dy = y1 - y2
dz = z1 - z2
r = sqrt(dx*dx + dy*dy + dz*dz)
total += rho_1[i]*rho_2[j]/r
return total*dV
start = time.time()
total = calc_coulomb_numba(data_1.coords, data_1.cube_data, data_2.coords, data_2.cube_data, dV_12)
total_time = (time.time() - start)
all_timers['numba'] = total_time
print(f'numba: {total_time:8.2f} s ({total*AU_2_EV:.5f} a.u.)')
print(f'pure_python: {all_timers["pure_python"]:8.2f} s')
print(f'speed-up: {all_timers["pure_python"]/all_timers["numba"]:8.1f}x')
numba: 5.87 s (0.24229 a.u.)
pure_python: 5057.54 s
speed-up: 861.1x
Numba can also be used for multithreaded workloads. Simply add the parallel=True
argument to the jit decorator and replace range
with numba.prange
. This last part is crutial, as it tells Numba where to parallelize the loop.
# notice the parallel=True argument
@numba.jit(nopython=True, fastmath=True, parallel=True)
def calc_coulomb_numba_parallel(pts_1, rho_1, pts_2, rho_2, dV):
total = 0.0
n_pts_1 = len(pts_1)
n_pts_2 = len(pts_2)
for i in numba.prange(n_pts_1): # prange tells numba where to parallelize the loop
for j in range(n_pts_2):
x1, y1, z1 = pts_1[i]
x2, y2, z2 = pts_2[j]
dx = x1 - x2
dy = y1 - y2
dz = z1 - z2
r = sqrt(dx**2 + dy**2 + dz**2)
total += rho_1[i]*rho_2[j]/r
return total*dV
start = time.time()
numba.set_num_threads(4)
total = calc_coulomb_numba_parallel(data_1.coords, data_1.cube_data, data_2.coords, data_2.cube_data, dV_12)
total_time = (time.time() - start)
all_timers['numba_parallel'] = total_time
print(f'numba_parallel: {total_time:.2f} s ({total*AU_2_EV:.5f} a.u.)')
print(f'numba: {all_timers["numba"]:.2f} s ({total*AU_2_EV:.5f} a.u.)')
print(f'speed-up: {all_timers["numba"]/all_timers["numba_parallel"]:8.1f}x')
numba_parallel: 1.57 s (0.24229 a.u.)
numba: 5.87 s (0.24229 a.u.)
speed-up: 3.7x
Multiple forms of optimization can also be combined!
For example, we can call the parallel Numba compiled code above with multiple threads, but also spread each set of threaded work over multiple processes.
- I'm using 4 threads and two processes, but this can be tweaked based on your available hardware.
def calc_coulomb_MP_numba(n_proc, pts_1, rho_1, pts_2, rho_2, dV):
# outer loop will be split by each process
pts_1_split = np.array_split(pts_1, n_proc)
rho_1_split = np.array_split(rho_1, n_proc)
# inner loop will remain the same, so we simply copy the data
pts_2_copies = [pts_2]*n_proc
rho_2_copies = [rho_2]*n_proc
# a copy also needs to be supplied to each process
dV_list = [dV]*n_proc
with mp.Pool(n_proc) as pool:
func_params = zip(pts_1_split, rho_1_split, pts_2_copies, rho_2_copies, dV_list)
# call the numba version with multiple threads
results = pool.starmap(calc_coulomb_numba_parallel, func_params)
return np.sum(results)
N_PROCESS = 2
N_THREADS = 4
start = time.time()
numba.set_num_threads(N_THREADS) # set the number of threads
total = calc_coulomb_MP_numba(N_PROCESS, data_1.coords, data_1.cube_data, data_2.coords, data_2.cube_data, dV_12)
total_time = (time.time() - start)
all_timers['multiproc + numba_parallel'] = total_time
print(f'multiproc + numba_parallel: {total_time:.2f} s ({total*AU_2_EV:.5f} a.u.)')
multiproc + numba_parallel: 0.83 s (0.24229 a.u.)
Again, due to Jupyter not cooperating with multiprocessing
, we have an external version as well
import subprocess
command = 'python external/combo_external.py'
# call python as a external program. This is VERY unorthodox and not usually recommended
subprocess.call(command.split())
total_time, total = np.loadtxt('_combo_external.txt')
all_timers['multiproc + numba_parallel'] = total_time
print(f'multiproc + numba_parallel: {total_time:.2f} s ({total*AU_2_EV:.5f} a.u.)')
Not all threaded applications will scale perfectly, especially when many cores are involved! Depending on the aplication, threaded applications may platau with a small number of cores.
In the above example, 7 or 8 threads seem to be the maximum amount that should be used to squeeze out extra threading performance. However, let's also say that you have a total of 32 cores to use! Instead of forcing the application to run on 32 threads, you could run 4 processes, each with 8 threads.
print(f'{"Method":28s} {"Time":>7s}')
print('--------------------------------------')
for name, val in all_timers.items():
print(f'{name:28s} {val:7.2f} s')
print('--------------------------------------')
Method Time
--------------------------------------
pure_python 5057.54 s
pure_python_threaded 5118.22 s
numpy_threaded 24.97 s
numpy 62.18 s
multiprocess_numpy 18.88 s
numba 5.87 s
numba_parallel 1.57 s
multiproc + numba_parallel 0.83 s
--------------------------------------