# Week 5 (Unit 4) lab: Multiprocessing in Python

## Introduction

In this exercise, we are exploring the multiprocessing capabilities of Python to support the concepts of Processes, Process Lifecycle and Concurrent Execution. Over the last few weeks we've examined the time complexity of several different algorithms for sorting and matrix multiplication. To interrogate their behaviour, we instantiated the algorithm `N` times and were evaluating these _sequentially_, calculating the mean running time and standard deviation and examining how the dependent variable (time) changes as we linearly increase some independent variable (size of list or matrix). 

Fundamentally, this approach is going to be quite slow, since it's only leveraging a single available CPU core. For example, running 100 repeats sequentially of a product between two 250x250 square matrixes can take quite a long time. But we might have multiple CPU cores available to us, and we can leverage these by creating multiple worker processes to get results much faster. But we have to be careful, if we do this incorrectly, we will add noise to our results making them much less reliable and/or harder to interpret.

Today's lab provides you with both the tools and the intutition for how to find the critical threshold that maximises use of the resources available to you on any given machine (this approach is scalable beyond a single machine). You are tasked with running a `core_investigation` to find the critical threshold for the machine you are working on (i.e. that doesn't add noise to the running time of an algorithm). You are provided with an arbitrary algorithm, contained in `defs.py` - the function `work_that_CPU` - and we want to minimise the amount of time it takes to run `N` instantiations of the algorithm, while keeping the amount of work within each instantiation fixed.

You should empirically demonstrate what the optimal number of processes is and _explain why_. Think about what is going on within the machine. Why would results be unreliable when we've breached the critical threshold?

Once you have completed this, you are then tasked with splitting up the resources you do have available in a way that will provide you with the best assessment of the performance of the algorithm in question.

## Exercises

First let's import what we need from the multiprocessing library, and *time*, for timing code.

In [1]:
from multiprocessing import Pool
import time
import os

The multiprocessing package supports the spawning of separate processes running your Python code. In the following code, we create the given number of processes. Note that these are operating system level processes with their own threads of control and address spaces.

Let's investigate how processes are mapped to cores. For this, we're going to create a worker function that executes a lot of instructions (CPU-Bound), and we're going to time how long it takes. 

This worker function must be placed in an auxiliary file, here *defs.py*. Take a look at this file, and then import it by running the cell below.

In [2]:
import defs

In defs, there is a function called *work_that_CPU*, which will generate some random numbers (to the amount specified by the input *num_loops*), and save the time taken to do this, as well as the name of the process that is executing it.  

Next we write a function that will run *work_that_CPU* a number of times (as specified by the input *num_processes*), with each run being given to a different process.

In [3]:
def core_investigation(num_processes,num_loops):
    
    inputs=[num_loops]*num_processes
    pool_of_processes = Pool(processes=num_processes)# Create processes to the amount specified by num_processes
    pool_of_processes.map(defs.work_that_CPU,inputs) # Assign a run of work_that_CPU to each process, with num_loops as the input
        
    pool_of_processes.close()  # It is good practice for sake of efficiency to kill the processes once you're done with them.

Each of our processes will be mapped to a physical core. Starting from 1 process, the time taken for each process should remain roughly constant until we exceed the available number of cores (which could range from 4 to 16 depending on your machine).

To see this, play around with the number of processes (*num_processes*) and the amount of computation in each process (*num_loops*) below. Settle on a value for *num_loops* so that the execution of this for a single process takes around 1s to complete.

In [16]:
num_loops=10000000
num_processes=2
core_investigation(num_processes,num_loops)

# Get the run-times for each of the processes, print them out, and then delete the files that store these.
for filename in os.listdir('.'):
    if filename.startswith('TimedProcess'):
        with open(filename, 'r') as file:
            content = file.read()
            print(content)
        os.remove(filename)

0.33145713806152344
0.3354330062866211


If each process is mapped to a different core, then the time taken should remain roughly constant. If not, then execution time should increase. 

Now run this for all values of *num_processes* from 1 up to 20, and compute for each run the total amount of time taken to run all the processes. Plot a graph of number of processes executing *work_that_CPU* against total time taken.

Note: you do not now need to save to disk all the times taken for every process here, so comment out the lines that do this in *work_that_CPU* in *defs.py*. To keep track of progress just print every value of *num_processes* as you get to it.

In [26]:
num_loops=10000000
num_processes=20

for i in range(1,num_processes):
    core_investigation(i,num_loops)
    print(" ")

SpawnPoolWorker-489 took 0.3211820125579834 seconds
 
SpawnPoolWorker-491 took 0.3270840644836426 seconds
SpawnPoolWorker-490 took 0.32973313331604004 seconds
 
SpawnPoolWorker-493 took 0.33600902557373047 seconds
SpawnPoolWorker-492 took 0.33603978157043457 seconds
SpawnPoolWorker-494 took 0.3364839553833008 seconds
 
SpawnPoolWorker-495 took 0.33811092376708984 seconds
SpawnPoolWorker-496 took 0.3376791477203369 seconds
SpawnPoolWorker-497 took 0.33625316619873047 seconds
SpawnPoolWorker-498 took 0.3363349437713623 seconds
 
SpawnPoolWorker-501 took 0.3210580348968506 seconds
SpawnPoolWorker-500 took 0.3357236385345459 seconds
SpawnPoolWorker-499 took 0.3404719829559326 seconds
SpawnPoolWorker-502 took 0.33951306343078613 seconds
SpawnPoolWorker-503 took 0.33696603775024414 seconds
 
SpawnPoolWorker-504 took 0.3328440189361572 seconds
SpawnPoolWorker-506 took 0.33389711380004883 seconds
SpawnPoolWorker-505 took 0.33776021003723145 seconds
SpawnPoolWorker-507 took 0.3352088928222656 s

You should notice that the total run-time is the same amount of time that it takes each process to run, even when there are more processes than cores available. Why do you think this is the case?

> _Answer_: The scheduler in your computer makes sure that all processes that you've created take turns at running, so they keep switching in and out of the CPU. The first process doesn't just hog the CPU until it is finished!

Now run *work_that_CPU* 20 times without using multiple processes, and time how long it takes. It should take a lot longer.

In [30]:
num_loops=10000000
num_processes=20

for i in range(1,num_processes):
    core_investigation(i,num_loops)
    print(" ")

0.3219902515411377 seconds
 
0.3266019821166992 seconds
0.3292059898376465 seconds
 
0.3365669250488281 seconds
0.33625197410583496 seconds
0.3388187885284424 seconds
 
0.3350858688354492 seconds
0.3385002613067627 seconds
0.338947057723999 seconds
0.3379251956939697 seconds
 
0.3255038261413574 seconds
0.3368990421295166 seconds
0.33895301818847656 seconds
0.33620619773864746 seconds
0.34009790420532227 seconds
 
0.3395829200744629 seconds
0.3427257537841797 seconds
0.3364901542663574 seconds
0.3404538631439209 seconds
0.3380460739135742 seconds
0.34150075912475586 seconds
 
0.3416140079498291 seconds
0.3402440547943115 seconds
0.340224027633667 seconds
0.338695764541626 seconds
0.3370249271392822 seconds
0.3373110294342041 seconds
0.3364701271057129 seconds
 
0.33725905418395996 seconds
0.3366227149963379 seconds
0.34212398529052734 seconds
0.3396718502044678 seconds
0.34276318550109863 seconds
0.33931493759155273 seconds
0.3418159484863281 seconds
0.3395872116088867 seconds
 
0.3816

Now, you're going to keep fixed at 20 the number of times that you're running work_that_CPU, and vary the number of concurrent processes that will do this job, from 1 up to 20. You'll then compute the run-times and plot them. That is, you'll plot number of concurrently running processes against time taken.

To do this, the first step is to write a modified version of the *core_investigation* function that now contains separate inputs for the number of times *work_that_CPU* will be run (*num_repeats*) and the number of processes that will run concurrently (*num_processes*). 

> Hint: You will do a certain number of rounds with *work_that_CPU* given to *num_processes* processes, until the number of repeats left to do is less than *num_processes*. Then you do one more round with the number of processes being the number of repeats you still need to run.

In [None]:
def core_investigation2(num_repeats, num_processes,num_loops=10000000):
    
    inputs=[num_loops]*num_processes
    pool_of_processes = Pool(processes=num_processes)# Create processes to the amount specified by num_processes
    pool_of_processes.map(defs.work_that_CPU,inputs) # Assign a run of work_that_CPU to each process, with num_loops as the input
        
    pool_of_processes.close()  # It is good practice for sake of efficiency to kill the processes once you're done with them.

num_processes=20

core_investigation2(i)

Now, compute the run-times and make the plot. Keep fixed at 20 the number of times that you're running *work_that_CPU*, and vary the number of concurrent processes that will do this job, from 1 up to 20. Compute the run-times and plot them. That is, plot number of concurrently running processes against time taken.

> Hint: Do not use lots of repeats in the *timeit* function. You don't want this to take too long. Also consider printing progress reports along the way so you know how long you're going to have to wait for output.

**Optional extension.** Repeat the above exercise running some other functions multiple times. Use a function that actually has some output (which will need to be sent back to the parent process, i.e. the rest of the Jupyter notebook). Do you always get speed up by parallelising? Play around with functions that take different lengths of time to run, and which return different quantities of data.