Parallel Processing

Parallel processing on the CPU

General Principles

Performance can be increased significantly using the Parallel toolbox provided by MATLAB. By default, with the exception of built-in functions, MATLAB will run mostly on one thread. Modern CPUs have multiple cores and multiple threads so in the default behavior CPU utilization will be low when running the code. In my case, I only got ~20% utilization with an Intel 8700K @ 4.8GHz.

The parallel toolbox is installed as an add-on and can then be used by a turning a for loop into a parfor loop. There are limitations to parfor loops; eg they can't be nested, and there is a performance penalty when starting one (so ideally we want to have the outmost loop being a parfor one). The parfor loop works by assigning different iterations to the worker CPUs so it's extremely important to ensure our iterations are independent; otherwise a worker could request a variable only available to another worker.

There are restrictions to how indexed variables in a parfor loop can be used. See write_Look_Up_Parallel for an example on how to circumvent them.

Pool preferences

I recommend that you change the preferences for the default cluster to match your CPU. In my case, the 8700K has 6 cores and 12 threads so I had the option of either use 6 workers, 2 threads each (with MATLAB handling the multithreading), or 12 workers, 1 thread each (with the processor handling the multithreading). It turns out the latter option is significantly faster, leading to constant 100% CPU utilization. The former could only do about ~80% CPU load.

Assigning equal load.

When computing the homology H_k(S^nσ+mλ) we can have an outer parfor loop iterating on n, and inside that have nested loops iterating on m and k. The problem with this implementation would be load imbalance: The worker doing the n=1 iteration would finish much sooner than the one doing the n=50 iteration, and then it would sit mostly idle. Thus it's important to assign the load equally to all workers. This can be done using the parfor options and our function that distributes the inputs to workers, distribute_inputs_to_workers. This function works by assigning a weight to each (n,m), ordering all (n,m)'s in our range by weight and then forming 12 subsets of equal total weight (or close to that). Each worker handles just one subset of these (n,m)'s. The weight function that heuristically seems to be fastest is n+m^3, but I haven't done exhaustive testing here.