# Benchmarking QSPR Models and More

The `qsprpred.benchmarks` module provides a set of functions to benchmark QSPR models, preparation steps, hyperparameter optimization strategies, molecular descriptors and more on various data sets that are already prepared or of your own choosing. In this tutorial, we will provide a simple example to benchmark a selection of models and descriptors on a simple task. 

The first step is to decide, which data set(s) to use for your benchmark. To make things a little faster,  we will use our small testing data source to fetch our testing data set:

In [1]:
import os

from qsprpred.data import MoleculeTable
from qsprpred.data.sources import DataSource

# TODO: it  would be nice  to also have examples for all the data sets we integrate and how to integrate a new data set here

BASE_DIR = "../../tutorial_output/benchmarking"  # directory to store all benchmarking results and files
os.makedirs(BASE_DIR, exist_ok=True)
SEED = 42  # random seed for all random operations


class DataSourceTesting(DataSource):
    """
    Just a simple wrapper around our tutorial data set.
    """

    def __init__(self, name: str, store_dir: str):
        self.name = name
        self.storeDir = store_dir

    def getData(self, name: str | None = None, **kwargs) -> MoleculeTable:
        """We just need to fetch a simple `MoleculeTable`.
        Defining target properties is not necessary. We just need to
        make sure that the data set contains the target properties we
        want to use for benchmarking later. 
        
        To make things faster we will sample only 100 molecules each time.
        This code could also be simplified so that reloading of the file is not necessary.
        """
        name = name or self.name
        return MoleculeTable(
            df=pd.read_table('../../tutorial_data/A2A_LIGANDS.tsv').sample(100,
                                                                           random_state=SEED),
            name=name,
            store_dir=self.storeDir,
            **kwargs
        )


source = DataSourceTesting("TutorialBenchmarkData", f"{BASE_DIR}/data")
example_ds = source.getData()
example_ds.getDF()

Unnamed: 0_level_0,SMILES,pchembl_value_Mean,Year,QSPRID
QSPRID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
TutorialBenchmarkData_000,CCCn1c(-c2ccccc2)nc2c1ncnc2NC1CCOC1,5.770,2018.0,TutorialBenchmarkData_000
TutorialBenchmarkData_001,CCCn1c(=O)c2c([nH]c(-c3c[nH]nc3)n2)n(CCC)c1=O,6.640,2006.0,TutorialBenchmarkData_001
TutorialBenchmarkData_002,COc1cccc2c1nc(N)n1nc(CN3CCN(c4ncc(F)cc4)CC3C)nc21,7.880,2015.0,TutorialBenchmarkData_002
TutorialBenchmarkData_003,COc1cccc(CCCC(=O)Nc2nc3c(cccc3)c(=O)s2)c1,6.940,2013.0,TutorialBenchmarkData_003
TutorialBenchmarkData_004,COc1c2nc(NC(=O)c3ccc(F)cc3)sc2c(N(CCO)C(C)=O)cc1,7.010,2010.0,TutorialBenchmarkData_004
...,...,...,...,...
TutorialBenchmarkData_095,Cc1ccc2c(O)c(C(=O)NCc3ccco3)cnc2n1,7.460,2005.0,TutorialBenchmarkData_095
TutorialBenchmarkData_096,CC(O)CCc1nc(N)c2nc(-n3nccn3)n(C)c2n1,7.102,2013.0,TutorialBenchmarkData_096
TutorialBenchmarkData_097,CCN(C(C)=O)c1ccc(Cl)c2c1sc(NC(=O)c1ccc(F)cc1)n2,7.820,2010.0,TutorialBenchmarkData_097
TutorialBenchmarkData_098,CS(=O)(=O)c1ccc(Cn2ncc3c2nc(N)nc3-c2ccco2)cc1,6.470,2008.0,TutorialBenchmarkData_098


In [2]:
# this is where all data sets derived from our source will be stored
example_ds.storeDir

'../../tutorial_output/benchmarking/data/TutorialBenchmarkData'

You can wrap your own data set in a `DataSource` class to use it for benchmarking. Just implement the `qsprpred.data.sources.data_source.DataSource` interface. You can use it to cache data sets so that they are not downloaded and recreated every time you run your benchmark. You just need to make sure it contains the  `TargetProperty` values, you want to use for benchmarking and the molecules. 

Next thing to do is to decide the benchmarking settings. You will use the `BenchmarkSettings` class to specify all details of your benchmark:

In [3]:
from sklearn.naive_bayes import GaussianNB
from qsprpred.models import SklearnModel, TestSetAssessor
from qsprpred.data.processing.feature_filters import LowVarianceFilter
from sklearn.preprocessing import StandardScaler
from qsprpred.data import RandomSplit
from qsprpred import TargetProperty, TargetTasks
from qsprpred.data.descriptors.sets import FingerprintSet, RDKitDescs
from qsprpred.benchmarks import BenchmarkSettings, DataPrepSettings

settings = BenchmarkSettings(
    name="TutorialBenchmark",
    n_replicas=1,
    # number of repeated experiments for statistical power, just 1 here for faster calculation
    random_seed=SEED,  # random seed for all random operations
    data_sources=[
        # one or more data sources to use for benchmarking
        source
    ],
    descriptors=[
        # in this case we will test 3 different combinations of descriptors
        # [
        #     FingerprintSet(
        #         fingerprint_type="MorganFP", radius=2, nBits=256
        #     ),
        #     RDKitDescs()
        # ],
        [
            FingerprintSet(
                fingerprint_type="MorganFP", radius=2, nBits=256
            )
        ],
        [
            RDKitDescs()
        ]
    ],
    target_props=[
        # one or more properties to model
        [
            TargetProperty.fromDict(
                {
                    "name": "pchembl_value_Mean",
                    "task": TargetTasks.SINGLECLASS,
                    "th": [6.5]
                }
            )
        ],
    ],
    prep_settings=[
        # we will compare two splitting strategies, each time with the same feature filters and standardization
        DataPrepSettings(
            split=RandomSplit(test_fraction=0.2),  # random split
            feature_filters=[LowVarianceFilter(0.05)],
            feature_standardizer=StandardScaler()
        ),
        # DataPrepSettings(
        #     split=ClusterSplit(test_fraction=0.2),  # scaffold split
        #     feature_filters=[LowVarianceFilter(0.05)],
        #     feature_standardizer=StandardScaler()
        # ),
    ],
    models=[
        SklearnModel(
            name="GaussianNB",
            alg=GaussianNB,
            base_dir=f"{BASE_DIR}/models",
        ),
        # SklearnModel(
        #     name="XGBClassifier",
        #     alg=XGBClassifier,
        #     base_dir=f"{BASE_DIR}/models",
        # )
    ],
    assessors=[
        TestSetAssessor(scoring="roc_auc"),
        TestSetAssessor(scoring="matthews_corrcoef", use_proba=False),
    ],
    optimizers=[],
    # no optimizers at this time (this feature actually needs some tweaking still because the parameter grid depends on the model)
)

No random state supplied, and could not find random state on the dataset.


In [4]:
# settings can be saved and reloaded from a file
BenchmarkSettings.fromFile(settings.toFile(f"{BASE_DIR}/settings.json"))

BenchmarkSettings(name='TutorialBenchmark', n_replicas=1, random_seed=42, data_sources=[<__main__.DataSourceTesting object at 0x7f8a4e492f20>], descriptors=[[<qsprpred.data.descriptors.sets.FingerprintSet object at 0x7f8a4e4932e0>], [<qsprpred.data.descriptors.sets.RDKitDescs object at 0x7f8a4e490730>]], target_props=[[TargetProperty(name=pchembl_value_Mean, task=SINGLECLASS, th=[6.5])]], prep_settings=[DataPrepSettings(split=<qsprpred.data.sampling.splits.RandomSplit object at 0x7f8a4e493a90>, smiles_standardizer='chembl', feature_filters=[<qsprpred.data.processing.feature_filters.LowVarianceFilter object at 0x7f8a4e490700>], feature_standardizer=StandardScaler(), feature_fill_value=0.0)], models=[<qsprpred.models.sklearn.SklearnModel object at 0x7f8a4e4933a0>], assessors=[<qsprpred.models.assessment_methods.TestSetAssessor object at 0x7f8a4e493010>, <qsprpred.models.assessment_methods.TestSetAssessor object at 0x7f8a4e492da0>], optimizers=[])

In [5]:
# to run the benchmark we need a runner
from qsprpred.benchmarks import BenchmarkRunner

runner = BenchmarkRunner(settings, data_dir=BASE_DIR)

In [6]:
# we can check how many experiments we will run
runner.nRuns

2

This is the number of full modeling workflows we will run. We can check the details of each one by iterating over the replicas:

In [7]:
for replica in runner.iterReplicas():
    # replicas contain all info needed to run an experiment
    print(replica)
    print(replica.dataSource.name)
    print(replica.descriptors)
    print(replica.targetProps)
    print(replica.prepSettings.split)
    break

TutorialBenchmark_2746317213
TutorialBenchmarkData
[<qsprpred.data.descriptors.sets.FingerprintSet object at 0x7f8a4e492d10>]
[TargetProperty(name=pchembl_value_Mean, task=SINGLECLASS, th=[6.5])]
<qsprpred.data.sampling.splits.RandomSplit object at 0x7f89c2ea42b0>


Replicas have methods to run a full experiment, but we will let the runner handle this for us and distribute the calculations over the available CPUs:

In [8]:
import shutil

if os.path.exists(BASE_DIR):
    # start fresh (not needed if you want to reuse results from the previous run)
    shutil.rmtree(BASE_DIR, ignore_errors=True)

# this will start the actual benchmarking
runner.run(raise_errors=True)

INFO:qsprpred:Target property converted to classification.
INFO:qsprpred:Dataset 'TutorialBenchmarkData' created for target targetProperties: '[TargetProperty(name=pchembl_value_Mean_class, task=SINGLECLASS, th=[6.5])]'.
INFO:qsprpred:Target property converted to classification.
INFO:qsprpred:Dataset 'TutorialBenchmarkData' created for target targetProperties: '[TargetProperty(name=pchembl_value_Mean_class, task=SINGLECLASS, th=[6.5])]'.
INFO:qsprpred:Total: train: 80 test: 20
INFO:qsprpred:Target property: pchembl_value_Mean_class
INFO:qsprpred:    In train: active: 47 not active: 33
INFO:qsprpred:    In test:  active: 11 not active: 9

INFO:qsprpred:number of columns dropped low variance filter: 42
INFO:qsprpred:number of columns left: 214
INFO:qsprpred:Selected features: ['Descriptor_FingerprintSet_MorganFP_0', 'Descriptor_FingerprintSet_MorganFP_1', 'Descriptor_FingerprintSet_MorganFP_3', 'Descriptor_FingerprintSet_MorganFP_4', 'Descriptor_FingerprintSet_MorganFP_5', 'Descriptor_Fi

FileNotFoundError: [Errno 2] No such file or directory: '../../tutorial_output/benchmarking/data/TutorialBenchmarkData/TutorialBenchmarkData_FingerprintSet_MorganFP/TutorialBenchmarkData_FingerprintSet_MorganFP_Descriptor/TutorialBenchmarkData_FingerprintSet_MorganFP_Descriptor_meta.json'

Note on reusing results: It is recommended to reuse results only when the settings of the runner have not changed. If they did, you might be running a risk of reusing irrelevant results because the same ID can be generated for a replica that defines a different experiment in the new settings.

The runner has finished and we can now load the results of our benchmark. It is a simple data frame:

In [None]:
import pandas as pd

df_results = pd.read_table(runner.resultsFile)
df_results.shape

It is 2 times the number of replicas because we have two assessors:

In [None]:
2 * runner.nRuns

We can check the results for the first replica:

In [None]:
df_results.iloc[0, :]

We can see that for this replica we were testing the `GaussianNB` model with the `CL` property and the `RandomSplit` strategy. We can see the `roc_auc` score in  the `Score` column. We can always recreate the replica from the saved file:

In [None]:
from qsprpred.benchmarks import Replica

replica = Replica.fromFile(df_results.iloc[0, :].ReplicaFile)
replica

We could now recreate the experiment:

In [None]:
# reinitialize data
replica.initData()

In [None]:
# reinitialize descriptors (this actually fetches them from cache, not  recalculate them)
replica.addDescriptors()

In [None]:
replica.ds.getDescriptors()

In [None]:
replica.prepData()