True parallelism in pure-Python code

Accompanying article: rishiraj.me/articles/2024-04/python_subinterpreter_parallelism

This is a proof-of-concept implementation of running "pure" Python code truly parallely in a single CPython process. This is achieved utilizing the new per-interpreter GIL construct from Python 3.12.

Here is a rough summary of what I have achieved:-

Create a local CPython function subinterpreter_parallelism.parallel which allows users to call any arbitrary Python function using multiple threads, each starting up its own sub-interpreter with its own GIL.
This function takes in a variable number of lists, where each of the lists would consist of ["module_name", "function_name", args].
Internally, we iterate over each of these lists, and spawn a new thread with their own interpreters to run those specific requests parallely.
Since we can't easily share objects between these interpreters, I've opted to take module name & function name as strings from user and pass the same as an argument to the thread. Here, for the Python function's arguments, I'm pickling the args object to a std::string using pickle module.
The spawned threads then create a new interpreter with its own GIL & run the function.
Once the function completes execution in the thread, it notifies the result to the Python function using a promise.
The Python function then unpickles the results as the different futures get resolved, and add them to a list.
This list is finally returned to the user when all the function calls have completed.

Caveats

This might not work well, or at all, with various other commonly used Python libraries (e.g. Numpy) in its current state. This is because, by default, all C/C++ extension modules are initialized without support for multiple interpreters. This holds true for all modules created using Cythonize (like Numpy), as of April, 2024. This is because C extension libraries interact regularly with low-level APIs (like PyGIL_*) which are known to not work well with multiple sub-interpreters. Refer caveats section from Python documentation. Hopefully, more libraries add support for this paradigm as it gains more adoption.
Performance should still be much better with pure C++ code for highly CPU bound tasks, due to the overhead associated with Python being an interpreted language.
Since very little is shared between interpreters in my setup, things like logging configuration, imports etc. need to be explicitly provided in the functions being run parallely.

Note that this is just an experimental project done over a weekend that might be of interest to others interested in parallelism & Python evolution.

Installation

Please use Python 3.12 & above for testing this out. Here, the commands given are for Linux & might require tweaking on other operating systems.

Create & activate a virtual environment python3.12 -m venv .venv && source .venv/bin/activate
Add benchmarking directory from this folder is treated as a Python source directory. export PYTHONPATH=$PYTHONPATH:`pwd`/benchmarking
Ensure that setuptools is available locally. pip install setuptools
Setup Python extension locally. python3 setup.py install
Run demo code to validate things are working fine python3 demo.py

Usage

from subinterpreter_parallelism import parallel

# Run 3 threads of pure python functions in parallel using sub-interpreters.
result = parallel(['module1', 'func1', (arg11, arg12, arg13),)],
                  ['module2', 'func2', (arg21, arg22)],
                  ['module3', 'func3', tuple()])

Statistics

Using normal Python threads, we can't gain any performance improvement for CPU bound tasks in CPython due to GIL contention. Hence, comparing parallelism using a simple factorial function, we get the following statistics:-

Method	Total time taken
Multi-processing	15.07s
Sub-interpreters	11.48s
C++ extension code (with GIL relinquished)	0.74s

Out of the total time taken for running a functions parallely using sub-interpreters, we see the following breakdown of time taken at each step:-

Step	Time taken (ms)
Creating interpreters	17
Imports & pickling/unpickling	35
Function call	2020
Ending interpreters	2.7

Takeaways

Using sub-interpreter paralllelism, I was able to verify that the Python process is constantly hitting close to full CPU utilization across all cores (1995% CPU utilization for machine with 20 cores). Note that with regular Python threads, CPU utilization hovers around 100% as expected (i.e. almost full utilization of a single core).
Significantly (>20%) better performance with subinterpreter parallelism compared to multi-processing.
Due to the inherent slowness associated with a interpreted language, it's still better to implement the CPU-bound part of the functionality in C++ using Python extensions.
With Python 3.13, much of this work would be redundant as interpreters would be made part of the stdlib itself. However, it's still fascinating to see how we can achieve similar results in Python 3.12.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
benchmarking		benchmarking
.gitignore		.gitignore
README.md		README.md
demo.py		demo.py
pure_cpp_parallelism.cpp		pure_cpp_parallelism.cpp
setup.py		setup.py
subinterpreter_parallelism.cpp		subinterpreter_parallelism.cpp

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

benchmarking

benchmarking

.gitignore

.gitignore

README.md

README.md

demo.py

demo.py

pure_cpp_parallelism.cpp

pure_cpp_parallelism.cpp

setup.py

setup.py

subinterpreter_parallelism.cpp

subinterpreter_parallelism.cpp

Repository files navigation

True parallelism in pure-Python code

Caveats

Installation

Usage

Statistics

Takeaways

About

Releases

Packages

Languages

RishiRaj22/PythonParallelism

Folders and files

Latest commit

History

Repository files navigation

True parallelism in pure-Python code

Caveats

Installation

Usage

Statistics

Takeaways

About

Resources

Stars

Watchers

Forks

Languages