# Parallel Python -- Multiprocess(ing)

The `multiprocessing` module allows us to run multiple executable processes of Python code instead of multithreading. This allows Python to easily bypass the Global Interpreter Lock that makes multithreading very difficult.

An advantage to this method is that each process has its own independent memory space, and so there will be no conflicts of multiple threads trying to access the same memory-location at the same time.

### Example: A Serialized Program

The most basic use of the multiprocessing module is the `Process` class. Let's make a simple program that is serialized that generates a list of four random strings.

In [1]:
import random
import string

random.seed(123)

n = 16 # iterations

# define a string generator function
def rand_string(length):
    """ Generates a random string of numbers, lower- and uppercase chars. """
    rand_str = ''.join(random.choice(
                        string.ascii_lowercase
                        + string.ascii_uppercase
                        + string.digits)
                   for i in range(length))
    return rand_str

# Setup a list of processes that we want to run
processes = [ rand_string(5) for x in range(n) ]

# Print our result
print(processes)

['drfXA', 'rg153', 'cyIJv', 'v2dki', 'vJvSp', 'ka5BX', 'f4Mye', 'auUCg', '5cfQj', 'iY6bs', '6BKEq', 'E1cXt', 'vHZEn', '0MOHK', 'Z9uaz', '5XPGB']


In [2]:
# Let's benchmark how long it takes to generate all four strings

import time

%timeit ([rand_string(5) for x in range(n)])

85.2 µs ± 109 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)


## Parallelizing our Code

This code segment is **embarrassingly parallel**. It is trivial to parallelize and each process can act independently for the most part. Each string can be generated by a single process, and the final list can be assembled at the end.

### Determining the number of compute processors on a system

We can find the number of processors on a system to maximize our resources and performance. For Python, it is better to find all physical cores, and not any 'logical' cores resulting from symmetric multithreading (SMT) capabilities of a CPU.

In [3]:
import psutil

# logical =  False (we do not count logical cores from SMT)
psutil.cpu_count(logical=False)

28

### Introducing Multiprocess(ing): The `Process` class

The Process class is a method to explicitly spawn processes and are often called predefined functions.

### Example: Using `Process` to Parallelize our Program

In [4]:
import multiprocess as mp
import random
import string

random.seed(123)

# Define an output queue
# Queue allows communication between processes using Pickle-able data (.put() and .get() methods)
# Our processes only need to communicate when aggregating their output
output = mp.Queue()

# define a example function -- we'll input queue so that our output is collected
def rand_string(length, output):
    """ Generates a random string of numbers, lower- and uppercase chars. """
    rand_str = ''.join(random.choice(
                        string.ascii_lowercase
                        + string.ascii_uppercase
                        + string.digits)
                   for i in range(length))
    output.put(rand_str)

# Setup a list of processes that we want to run
# mp.Process is called to spawn a process
# target = the function we would like to run
# args = tuple of arguments that are accepted by the function

processes = [mp.Process(target=rand_string, args=(5, output)) for x in range(n)]

# Each process needs to be started
for p in processes:
    p.start()

# Wait until each process has finished before moving on!
for p in processes:
    p.join()

# Get process results from the output queue
results = [output.get() for p in processes]

print(results)

['koTia', 'CCm47', 'NEQ70', '7ZLDi', 'OWSFT', 'M24tK', 'pnLxV', 'np7uV', 'peF7z', 'FZBzQ', 'CMpHC', 'X5sP5', 'J4ka6', '1w9bb', 'dk9yH', 'P3CfN']


#### Order of Results

The output queue receives data in an unordered fashion, it only follows the order at which processes put data in the queue. This can be problematic if the order of elements in the result matter.

Luckily, the rank of each process can be tracked to make sorting the order of results easier.

In [None]:
import multiprocess as mp
import random
import string

random.seed(123)

output = mp.Queue()

# Let's create a rank parameter so we can assign a rank to each result
def rand_string(length, rank, output):
    """ Generates a random string of numbers, lower- and uppercase chars. """
    rand_str = ''.join(random.choice(
                        string.ascii_lowercase
                        + string.ascii_uppercase
                        + string.digits)
                   for i in range(length))
    output.put(rank, rand_str)

processes = [mp.Process(target=rand_string, args=(5, x, output)) for x in range(n)]

for p in processes:
    p.start()

for p in processes:
    p.join()

results = [output.get() for p in processes]

# sort the results by the rank of each process
results.sort()
results = [r[1] for r in results]

print(results)