# Code Parallelization

QSPRpred is also helpful to run parallel operations on data. It tries to take the headache out of parallelization by
providing a simple interface to run operations on data in parallel. In this tutorial, we will show how to use the
these features.

## Data Set

We will borrow the multitask data set from the [associated tutorial](../modelling/multi_task_modelling.ipynb) since it contains a larger number of molecules:

In [1]:
import pandas as pd

from qsprpred.data import MoleculeTable

# load the data
df = pd.read_csv('../../tutorial_data/AR_LIGANDS.tsv', sep='\t')
df = df.pivot(index="SMILES", columns="accession", values="pchembl_value_Mean")
df.columns.name = None
df.reset_index(inplace=True)
mt = MoleculeTable(name="ParallelizationExample", df=df)
len(mt)

6797

## Setting `nJobs` and `chunkSize`

QSPRpred supports parallelization of code by chunking the data set into smaller
pieces and running the code on each chunk in parallel. This is done by setting
the `nJobs` and `chunkSize` properties of the `MoleculeTable` object. The
`nJobs` property specifies the number of parallel jobs to run. The `chunkSize`
property specifies the number of molecules to process in each job. 

The `chunkSize` property is automatically calculated based on the number of jobs, but
in some cases it may be useful to set it manually. For example, if the code
being run in parallel is very fast, it may be useful to increase the chunk size
to reduce the overhead of parallelization. On the other hand, if the code being
run in parallel is very slow, it may be useful to decrease the chunk size to
reduce the amount of time spent waiting for the slowest job to finish. 

In addition, the
`chunkSize` property also affects the memory usage of the parallelization. If
the code being run in parallel is very memory intensive, it may be useful to
decrease the chunk size to reduce the memory usage of the parallel processes 
by running on smaller batches of data.

We will now illustrate a few different scenarios. First, we will run a simple
descriptor calculation in parallel:

In [2]:
from qsprpred.data.descriptors.sets import DescriptorSet
from qsprpred.data.descriptors.fingerprints import MorganFP
from qsprpred.utils.stopwatch import StopWatch


def time_desc_calc(data: MoleculeTable, desc_set: DescriptorSet):
    """A simple function to time descriptor calculation on a data table.
    
    Args:
        data: The data table to calculate descriptors on.
        desc_set: The descriptor set to calculate.
    """
    if data.hasDescriptors([desc_set])[0]:
        print(f"Removing old descriptors: {desc_set}")
        data.dropDescriptors([desc_set])
    print(f"Running and timing descriptor calculation: {desc_set}")
    watch = StopWatch()
    data.addDescriptors([desc_set])
    watch.stop()


time_desc_calc(mt, MorganFP(3, 2048))

Running and timing descriptor calculation: MorganFP
Time it took: 4.2293026379993535


This calculation is done on one CPU by default:

In [3]:
mt.nJobs

1

and the whole data set supplied as one chunk:

In [4]:
mt.chunkSize

6797

We can now try running this calculation in parallel on 2 CPUs:

In [5]:
mt.nJobs = 4

The chunk size will automatically be adjusted to 25% of the data set size so that each portion of the data set is processed on a separate CPU:

In [6]:
mt.chunkSize

1699

We can see how this affects the time taken to run the calculation:

In [7]:
time_desc_calc(mt, MorganFP(3, 2048))

Removing old descriptors: MorganFP
Running and timing descriptor calculation: MorganFP
Time it took: 1.6930840409986558


This was faster, but not by a factor of 4. This is because there is some overhead associated with parallelization and the calculation of fingerprints is very fast by itself so the overhead affects our runtime more. In such cases, be careful about setting the chunk size manually:

In [8]:
mt.chunkSize = 50
time_desc_calc(mt, MorganFP(3, 2048))

Removing old descriptors: MorganFP
Running and timing descriptor calculation: MorganFP
Time it took: 16.98889161799889


This was much much slower than even the single CPU calculation!

## Custom Operations

Descriptor calculators are already prepared actions that you can use with the `addDescriptors` method. However, you can also run custom operations on the data set in parallel. To do this, you need to use the `apply` method. This method takes a function as input and runs it on each chunk of the data set in parallel. The function must take a dictionary of properties as input and return anything as output:

In [9]:
def processing_function(props: dict, *args, **kwargs):
    """A simple function to process a chunk of a data table. Just prints and its arguments."""
    print(args)
    print(kwargs)
    for prop in props:
        print(prop, props[prop][0])


mt.nJobs = 2  # also resets the chunk size to 50% of the data set size again
mt.apply(processing_function, func_args=("A",), func_kwargs={"B": None})

<generator object PandasDataTable.apply at 0x7fabf9ccd540>

As you can see, this gives us a generator object. In order to run the function on each chunk and get the results, we need to iterate over the generator and collect results:

In [10]:
results = []
for result in mt.apply(processing_function, func_args=("A",), func_kwargs={"B": None}):
    results.append(result)

('A',)('A',)
{'B': None}

{'B': None}SMILES
 SMILESBrc1cc(Nc2nc3c(ncnc3N3CCCC3)s2)ccc1 
COc1cc(-n2c(=O)n(-c3c(OC)cccc3)c3c2nc(NC2CC2)nc3)ccc1P0DMS8
P0DMS8  5.89nan

P29274 P292746.61 
5.29P29275
 P29275nan 
nanP30542
 P30542nan 
QSPRID5.9 
ParallelizationExample_0000QSPRID
 
ParallelizationExample_3398('A',)
{'B': None}
SMILES c1nc2c(nc(Nc3ccc(N4CCOCC4)cc3)nc2NC2CCCCCCC2)[nH]1
P0DMS8 5.56
P29274 nan
P29275 nan
P30542 nan
QSPRID ParallelizationExample_6796


The results in this case are just four `None` values since our function doesn't return anything:

In [11]:
results

[None, None, None]

We can also instruct the `apply` method to pass a `DataFrame` instead of a dictionary of properties to the function. This is useful if you want to use the `pandas.DataFrame` API to process the data:

In [12]:
def processing_function_df(props: pd.DataFrame):
    """A simple function that gives us the shape of the chunk."""
    return props.shape


results = []
for result in mt.apply(processing_function_df, as_df=True):
    results.append(result)
results

[(3398, 6), (3398, 6), (1, 6)]

**WARNING:** The `apply` method does not guarantee that the results will be returned in the same order as the chunks were processed. This is because the chunks are processed in parallel and the order depends on the order in which the parallel processes finish.

### Molecule Processors

One step above the simple `apply` method is the `processMols` method. This method takes a `MolProcessor` object as input. This object must implement a `__call__` method that takes a list of molecules and a dictionary of properties as input and returns anything as output:

In [13]:
from qsprpred.data.processing.mol_processor import MolProcessor
from rdkit.Chem import Mol
from typing import Any


class MyProcessor(MolProcessor):
    def __call__(self, mols: list[str | Mol], props: dict[str, list[Any]], *args,
                 **kwargs) -> Any:
        """Just return a tuple of some data extracted for the first molecule in the chunk."""
        return mols[0], type(mols[0]), *props.keys()

    @property
    def supportsParallel(self) -> bool:
        """Needs to be set to indicate if parallelization is supported."""
        return True


results = []
for result in mt.processMols(MyProcessor()):
    results.append(result)
results

[('Brc1cc(Nc2nc3c(ncnc3N3CCCC3)s2)ccc1',
  str,
  'SMILES',
  'P0DMS8',
  'P29274',
  'P29275',
  'P30542',
  'QSPRID'),
 ('COc1cc(-n2c(=O)n(-c3c(OC)cccc3)c3c2nc(NC2CC2)nc3)ccc1',
  str,
  'SMILES',
  'P0DMS8',
  'P29274',
  'P29275',
  'P30542',
  'QSPRID'),
 ('c1nc2c(nc(Nc3ccc(N4CCOCC4)cc3)nc2NC2CCCCCCC2)[nH]1',
  str,
  'SMILES',
  'P0DMS8',
  'P29274',
  'P29275',
  'P30542',
  'QSPRID')]

With `processMols`, we can also automatically convert the molecules to RDKit molecules before passing them to the processor:

In [14]:
results = []
for result in mt.processMols(MyProcessor(), as_rdkit=True):
    results.append(result)
results

[(<rdkit.Chem.rdchem.Mol at 0x7fabf9dbecf0>,
  rdkit.Chem.rdchem.Mol,
  'SMILES',
  'P0DMS8',
  'P29274',
  'P29275',
  'P30542',
  'QSPRID'),
 (<rdkit.Chem.rdchem.Mol at 0x7fabf9e05e90>,
  rdkit.Chem.rdchem.Mol,
  'SMILES',
  'P0DMS8',
  'P29274',
  'P29275',
  'P30542',
  'QSPRID'),
 (<rdkit.Chem.rdchem.Mol at 0x7fabf9dbfa10>,
  rdkit.Chem.rdchem.Mol,
  'SMILES',
  'P0DMS8',
  'P29274',
  'P29275',
  'P30542',
  'QSPRID')]

You can also derive from `MolProcessorWithID` if you want to access the molecule IDs provided by the data set in your processor. This is useful to overcome the issue that the order in which chunks are processed is not guaranteed:

In [15]:
from rdkit.Chem import MolToInchiKey
from qsprpred.data.processing.mol_processor import MolProcessorWithID


class MyProcessorWithID(MolProcessorWithID):
    def __call__(self, mols: list[str | Mol], props: dict[str, list[Any]], *args,
                 **kwargs) -> Any:
        """Calculate Inchi Keys for the molecules in the chunk and return them as a DataFrame using `idProp` as index."""
        return pd.DataFrame({"InchiKey": [MolToInchiKey(x) for x in mols]},
                            index=props[self.idProp])

    @property
    def supportsParallel(self) -> bool:
        return True


# run the calculations
results = []
for result in mt.processMols(MyProcessorWithID(), as_rdkit=True):
    results.append(result)

# concatenate the results into a single DataFrame
df_iks = pd.concat(results)

# sort the DataFrame by the index to ensure same order as in the original molecule table
df_iks.sort_index(inplace=True)

# set the Inchi Keys as a property of the molecule table
mt.addProperty("InchiKey", df_iks.InchiKey.tolist())
mt.getProperty("InchiKey")

QSPRID
ParallelizationExample_0000    YQTYPSIBGJUFHX-UHFFFAOYSA-N
ParallelizationExample_0001    PLOWTFYCKMBDSF-UHFFFAOYSA-N
ParallelizationExample_0002    VPFDYFVHIDPXMF-UHFFFAOYSA-N
ParallelizationExample_0003    JRZQBZNLNNVCDD-UHFFFAOYSA-N
ParallelizationExample_0004    ZQOOZBCGGHKMAZ-UHFFFAOYSA-N
                                          ...             
ParallelizationExample_6792    ATQMYSVYZWCLGV-UHFFFAOYSA-N
ParallelizationExample_6793    BCUWHWNNRNCIEH-UHFFFAOYSA-N
ParallelizationExample_6794    ZFLJHSQHILSNCM-UHFFFAOYSA-N
ParallelizationExample_6795    IWDCLHPAOHUVIN-UHFFFAOYSA-N
ParallelizationExample_6796    SXZJJBXZKSACII-UHFFFAOYSA-N
Name: InchiKey, Length: 6797, dtype: object