# Benchmarking QSPR Models and More

The `qsprpred.benchmarks` module provides a set of functions to benchmark QSPR models, preparation steps, hyperparameter optimization strategies, molecular descriptors and more on various data sets that are already prepared or of your own choosing. In this tutorial, we will provide a simple example to benchmark a selection of models and descriptors on a simple task. 

The first step is to decide, which data set(s) to use for your benchmark. To make things a little faster,  we will use our small testing data source to fetch our testing data set:

In [1]:
import os
import pandas as pd

from qsprpred.data import MoleculeTable
from qsprpred.data.sources import DataSource

# TODO: it  would be nice  to also have examples for all the data sets we integrate and how to integrate a new data set here

BASE_DIR = "../../tutorial_output/benchmarking"  # directory to store all benchmarking results and files
os.makedirs(BASE_DIR, exist_ok=True)
SEED = 42  # random seed for all random operations


class DataSourceTesting(DataSource):
    """
    Just a simple wrapper around our tutorial data set.
    """

    def __init__(self, name: str, store_dir: str):
        self.name = name
        self.storeDir = store_dir

    def getData(self, name: str | None = None, **kwargs) -> MoleculeTable:
        """We just need to fetch a simple `MoleculeTable`.
        Defining target properties is not necessary. We just need to
        make sure that the data set contains the target properties we
        want to use for benchmarking later. 
        
        To make things faster we will sample only 100 molecules each time.
        This code could also be simplified so that reloading of the file is not necessary.
        """
        name = name or self.name
        return MoleculeTable(
            df=pd.read_table('../../tutorial_data/A2A_LIGANDS.tsv').sample(100,
                                                                           random_state=SEED),
            name=name,
            store_dir=self.storeDir,
            **kwargs
        )


source = DataSourceTesting("TutorialBenchmarkData", f"{BASE_DIR}/data")
example_ds = source.getData()
example_ds.getDF()

Unnamed: 0_level_0,SMILES,pchembl_value_Mean,Year,QSPRID
QSPRID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
TutorialBenchmarkData_000,CCCn1c(-c2ccccc2)nc2c1ncnc2NC1CCOC1,5.770,2018.0,TutorialBenchmarkData_000
TutorialBenchmarkData_001,CCCn1c(=O)c2c([nH]c(-c3c[nH]nc3)n2)n(CCC)c1=O,6.640,2006.0,TutorialBenchmarkData_001
TutorialBenchmarkData_002,COc1cccc2c1nc(N)n1nc(CN3CCN(c4ncc(F)cc4)CC3C)nc21,7.880,2015.0,TutorialBenchmarkData_002
TutorialBenchmarkData_003,COc1cccc(CCCC(=O)Nc2nc3c(cccc3)c(=O)s2)c1,6.940,2013.0,TutorialBenchmarkData_003
TutorialBenchmarkData_004,COc1c2nc(NC(=O)c3ccc(F)cc3)sc2c(N(CCO)C(C)=O)cc1,7.010,2010.0,TutorialBenchmarkData_004
...,...,...,...,...
TutorialBenchmarkData_095,Cc1ccc2c(O)c(C(=O)NCc3ccco3)cnc2n1,7.460,2005.0,TutorialBenchmarkData_095
TutorialBenchmarkData_096,CC(O)CCc1nc(N)c2nc(-n3nccn3)n(C)c2n1,7.102,2013.0,TutorialBenchmarkData_096
TutorialBenchmarkData_097,CCN(C(C)=O)c1ccc(Cl)c2c1sc(NC(=O)c1ccc(F)cc1)n2,7.820,2010.0,TutorialBenchmarkData_097
TutorialBenchmarkData_098,CS(=O)(=O)c1ccc(Cn2ncc3c2nc(N)nc3-c2ccco2)cc1,6.470,2008.0,TutorialBenchmarkData_098


In [2]:
# this is where all data sets derived from our source will be stored
example_ds.baseDir

'../../tutorial_output/benchmarking/data'

You can wrap your own data set in a `DataSource` class to use it for benchmarking. Just implement the `qsprpred.data.sources.data_source.DataSource` interface. You can use it to cache data sets so that they are not downloaded and recreated every time you run your benchmark. You just need to make sure it contains the  `TargetProperty` values, you want to use for benchmarking and the molecules. 

Next thing to do is to decide the benchmarking settings. You will use the `BenchmarkSettings` class to specify all details of your benchmark:

In [3]:
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.naive_bayes import GaussianNB
from qsprpred.models import SklearnModel, TestSetAssessor
from qsprpred.data.processing.feature_filters import LowVarianceFilter
from sklearn.preprocessing import StandardScaler
from qsprpred.data import RandomSplit, ClusterSplit
from qsprpred import TargetProperty, TargetTasks
from qsprpred.data.descriptors.sets import FingerprintSet, RDKitDescs
from qsprpred.benchmarks import BenchmarkSettings, DataPrepSettings

settings = BenchmarkSettings(
    name="TutorialBenchmark",
    n_replicas=15,
    # number of repeated experiments for statistical power, just 1 here for faster calculation
    random_seed=SEED,  # random seed for all random operations
    data_sources=[
        # one or more data sources to use for benchmarking
        source
    ],
    descriptors=[
        # in this case we will test 3 different combinations of descriptors
        [
            FingerprintSet(
                fingerprint_type="MorganFP", radius=2, nBits=256
            ),
            RDKitDescs()
        ],
        [
            FingerprintSet(
                fingerprint_type="MorganFP", radius=2, nBits=256
            )
        ],
        [
            RDKitDescs()
        ]
    ],
    target_props=[
        # one or more properties to model
        [
            TargetProperty.fromDict(
                {
                    "name": "pchembl_value_Mean",
                    "task": TargetTasks.SINGLECLASS,
                    "th": [6.5]
                }
            )
        ],
    ],
    prep_settings=[
        # we will compare two splitting strategies, each time with the same feature filters and standardization
        DataPrepSettings(
            split=RandomSplit(test_fraction=0.2),  # random split
            feature_filters=[LowVarianceFilter(0.05)],
            feature_standardizer=StandardScaler()
        ),
        DataPrepSettings(
            split=ClusterSplit(test_fraction=0.2),  # scaffold split
            feature_filters=[LowVarianceFilter(0.05)],
            feature_standardizer=StandardScaler()
        ),
    ],
    models=[
        SklearnModel(
            name="GaussianNB",
            alg=GaussianNB,
            base_dir=f"{BASE_DIR}/models",
        ),
        SklearnModel(
            name="ExtraTreesClassifier",
            alg=ExtraTreesClassifier,
            base_dir=f"{BASE_DIR}/models",
        )
    ],
    assessors=[
        TestSetAssessor(scoring="roc_auc"),
        TestSetAssessor(scoring="matthews_corrcoef", use_proba=False),
    ],
    optimizers=[],
    # no optimizers at this time (this feature actually needs some tweaking still because the parameter grid depends on the model)
)

No random state supplied, and could not find random state on the dataset.


In [4]:
# settings can be saved and reloaded from a file
BenchmarkSettings.fromFile(settings.toFile(f"{BASE_DIR}/settings.json"))

BenchmarkSettings(name='TutorialBenchmark', n_replicas=15, random_seed=42, data_sources=[<__main__.DataSourceTesting object at 0x7f6cfc0346d0>], descriptors=[[<qsprpred.data.descriptors.sets.FingerprintSet object at 0x7f6cfc037160>, <qsprpred.data.descriptors.sets.RDKitDescs object at 0x7f6cfc036590>], [<qsprpred.data.descriptors.sets.FingerprintSet object at 0x7f6cfc0367d0>], [<qsprpred.data.descriptors.sets.RDKitDescs object at 0x7f6cfc036380>]], target_props=[[TargetProperty(name=pchembl_value_Mean, task=SINGLECLASS, th=[6.5])]], prep_settings=[DataPrepSettings(split=<qsprpred.data.sampling.splits.RandomSplit object at 0x7f6cfc036650>, smiles_standardizer='chembl', feature_filters=[<qsprpred.data.processing.feature_filters.LowVarianceFilter object at 0x7f6cfc036b60>], feature_standardizer=StandardScaler(), feature_fill_value=0.0), DataPrepSettings(split=<qsprpred.data.sampling.splits.ClusterSplit object at 0x7f6cfc037340>, smiles_standardizer='chembl', feature_filters=[<qsprpred.dat

In [5]:
# to run the benchmark we need a runner
from qsprpred.benchmarks import BenchmarkRunner

runner = BenchmarkRunner(settings, data_dir=BASE_DIR)

In [6]:
# we can check how many experiments we will run
runner.nRuns

180

This is the number of full modeling workflows we will run. We can check the details of each one by iterating over the replicas:

In [7]:
for replica in runner.iterReplicas():
    # replicas contain all info needed to run an experiment
    print(replica)
    print(replica.dataSource.name)
    print(replica.descriptors)
    print(replica.targetProps)
    print(replica.prepSettings.split)
    break

TutorialBenchmark_2746317213
TutorialBenchmarkData
[<qsprpred.data.descriptors.sets.FingerprintSet object at 0x7f6d557d6bf0>, <qsprpred.data.descriptors.sets.RDKitDescs object at 0x7f6d557d71c0>]
[TargetProperty(name=pchembl_value_Mean, task=SINGLECLASS, th=[6.5])]
<qsprpred.data.sampling.splits.RandomSplit object at 0x7f6cfb1934f0>


Replicas have methods to run a full experiment, but we will let the runner handle this for us and distribute the calculations over the available CPUs:

In [8]:
import shutil

if os.path.exists(BASE_DIR):
    # start fresh (not needed if you want to reuse results from the previous run)
    shutil.rmtree(BASE_DIR, ignore_errors=True)

# this will start the actual benchmarking
runner.run(raise_errors=True)

Random state supplied, but alg <class 'sklearn.naive_bayes.GaussianNB'> does not support it. Ignoring this setting.


{}
{}


Random state supplied, but alg <class 'sklearn.naive_bayes.GaussianNB'> does not support it. Ignoring this setting.
Random state supplied, but alg <class 'sklearn.naive_bayes.GaussianNB'> does not support it. Ignoring this setting.


{}
{}

Random state supplied, but alg <class 'sklearn.naive_bayes.GaussianNB'> does not support it. Ignoring this setting.





Random state supplied, but alg <class 'sklearn.naive_bayes.GaussianNB'> does not support it. Ignoring this setting.


{}


Random state supplied, but alg <class 'sklearn.naive_bayes.GaussianNB'> does not support it. Ignoring this setting.


{}


Random state supplied, but alg <class 'sklearn.naive_bayes.GaussianNB'> does not support it. Ignoring this setting.


{}
{}


Random state supplied, but alg <class 'sklearn.naive_bayes.GaussianNB'> does not support it. Ignoring this setting.
Random state supplied, but alg <class 'sklearn.naive_bayes.GaussianNB'> does not support it. Ignoring this setting.


{}


Random state supplied, but alg <class 'sklearn.naive_bayes.GaussianNB'> does not support it. Ignoring this setting.


{}


Random state supplied, but alg <class 'sklearn.naive_bayes.GaussianNB'> does not support it. Ignoring this setting.


{}


Random state supplied, but alg <class 'sklearn.naive_bayes.GaussianNB'> does not support it. Ignoring this setting.


{}


Random state supplied, but alg <class 'sklearn.naive_bayes.GaussianNB'> does not support it. Ignoring this setting.


{}


Random state supplied, but alg <class 'sklearn.naive_bayes.GaussianNB'> does not support it. Ignoring this setting.
Random state supplied, but alg <class 'sklearn.naive_bayes.GaussianNB'> does not support it. Ignoring this setting.


{}
{}


Random state supplied, but alg <class 'sklearn.naive_bayes.GaussianNB'> does not support it. Ignoring this setting.


{}


Random state supplied, but alg <class 'sklearn.naive_bayes.GaussianNB'> does not support it. Ignoring this setting.


{}


Random state supplied, but alg <class 'sklearn.naive_bayes.GaussianNB'> does not support it. Ignoring this setting.


{}


Random state supplied, but alg <class 'sklearn.naive_bayes.GaussianNB'> does not support it. Ignoring this setting.


{}


Random state supplied, but alg <class 'sklearn.naive_bayes.GaussianNB'> does not support it. Ignoring this setting.


{}


Random state supplied, but alg <class 'sklearn.naive_bayes.GaussianNB'> does not support it. Ignoring this setting.


{}


Random state supplied, but alg <class 'sklearn.naive_bayes.GaussianNB'> does not support it. Ignoring this setting.


{}


Random state supplied, but alg <class 'sklearn.naive_bayes.GaussianNB'> does not support it. Ignoring this setting.


{}


Random state supplied, but alg <class 'sklearn.naive_bayes.GaussianNB'> does not support it. Ignoring this setting.


{}


Random state supplied, but alg <class 'sklearn.naive_bayes.GaussianNB'> does not support it. Ignoring this setting.


{}


Random state supplied, but alg <class 'sklearn.naive_bayes.GaussianNB'> does not support it. Ignoring this setting.


{}


Random state supplied, but alg <class 'sklearn.naive_bayes.GaussianNB'> does not support it. Ignoring this setting.


{}
{}


Random state supplied, but alg <class 'sklearn.naive_bayes.GaussianNB'> does not support it. Ignoring this setting.
Random state supplied, but alg <class 'sklearn.naive_bayes.GaussianNB'> does not support it. Ignoring this setting.


{}


Random state supplied, but alg <class 'sklearn.naive_bayes.GaussianNB'> does not support it. Ignoring this setting.


{}


Random state supplied, but alg <class 'sklearn.naive_bayes.GaussianNB'> does not support it. Ignoring this setting.


{}

Random state supplied, but alg <class 'sklearn.naive_bayes.GaussianNB'> does not support it. Ignoring this setting.



{}


Random state supplied, but alg <class 'sklearn.naive_bayes.GaussianNB'> does not support it. Ignoring this setting.


{}


Random state supplied, but alg <class 'sklearn.naive_bayes.GaussianNB'> does not support it. Ignoring this setting.
Random state supplied, but alg <class 'sklearn.naive_bayes.GaussianNB'> does not support it. Ignoring this setting.


{}
{}


Random state supplied, but alg <class 'sklearn.naive_bayes.GaussianNB'> does not support it. Ignoring this setting.


{}


Random state supplied, but alg <class 'sklearn.naive_bayes.GaussianNB'> does not support it. Ignoring this setting.


{}


Random state supplied, but alg <class 'sklearn.naive_bayes.GaussianNB'> does not support it. Ignoring this setting.
Random state supplied, but alg <class 'sklearn.naive_bayes.GaussianNB'> does not support it. Ignoring this setting.


{}
{}


Random state supplied, but alg <class 'sklearn.naive_bayes.GaussianNB'> does not support it. Ignoring this setting.


{}


Random state supplied, but alg <class 'sklearn.naive_bayes.GaussianNB'> does not support it. Ignoring this setting.


{}


Random state supplied, but alg <class 'sklearn.naive_bayes.GaussianNB'> does not support it. Ignoring this setting.


{}


Random state supplied, but alg <class 'sklearn.naive_bayes.GaussianNB'> does not support it. Ignoring this setting.


{}


Random state supplied, but alg <class 'sklearn.naive_bayes.GaussianNB'> does not support it. Ignoring this setting.


{}


Random state supplied, but alg <class 'sklearn.naive_bayes.GaussianNB'> does not support it. Ignoring this setting.


{}
{}


Random state supplied, but alg <class 'sklearn.naive_bayes.GaussianNB'> does not support it. Ignoring this setting.
Random state supplied, but alg <class 'sklearn.naive_bayes.GaussianNB'> does not support it. Ignoring this setting.


{}


Random state supplied, but alg <class 'sklearn.naive_bayes.GaussianNB'> does not support it. Ignoring this setting.


{}


Random state supplied, but alg <class 'sklearn.naive_bayes.GaussianNB'> does not support it. Ignoring this setting.


{}
{}


Random state supplied, but alg <class 'sklearn.naive_bayes.GaussianNB'> does not support it. Ignoring this setting.
Random state supplied, but alg <class 'sklearn.naive_bayes.GaussianNB'> does not support it. Ignoring this setting.


{}


Random state supplied, but alg <class 'sklearn.naive_bayes.GaussianNB'> does not support it. Ignoring this setting.


{}


Random state supplied, but alg <class 'sklearn.naive_bayes.GaussianNB'> does not support it. Ignoring this setting.


{}
{}


Random state supplied, but alg <class 'sklearn.naive_bayes.GaussianNB'> does not support it. Ignoring this setting.
Random state supplied, but alg <class 'sklearn.naive_bayes.GaussianNB'> does not support it. Ignoring this setting.


{}


Random state supplied, but alg <class 'sklearn.naive_bayes.GaussianNB'> does not support it. Ignoring this setting.
Random state supplied, but alg <class 'sklearn.naive_bayes.GaussianNB'> does not support it. Ignoring this setting.


{}
{}
{}


Random state supplied, but alg <class 'sklearn.naive_bayes.GaussianNB'> does not support it. Ignoring this setting.
Random state supplied, but alg <class 'sklearn.naive_bayes.GaussianNB'> does not support it. Ignoring this setting.


{}
{}


Random state supplied, but alg <class 'sklearn.naive_bayes.GaussianNB'> does not support it. Ignoring this setting.
Random state supplied, but alg <class 'sklearn.naive_bayes.GaussianNB'> does not support it. Ignoring this setting.


{}


Random state supplied, but alg <class 'sklearn.naive_bayes.GaussianNB'> does not support it. Ignoring this setting.


{}


Random state supplied, but alg <class 'sklearn.naive_bayes.GaussianNB'> does not support it. Ignoring this setting.


{}
{}


Random state supplied, but alg <class 'sklearn.naive_bayes.GaussianNB'> does not support it. Ignoring this setting.
Random state supplied, but alg <class 'sklearn.naive_bayes.GaussianNB'> does not support it. Ignoring this setting.


{}


Random state supplied, but alg <class 'sklearn.naive_bayes.GaussianNB'> does not support it. Ignoring this setting.


{}


Random state supplied, but alg <class 'sklearn.naive_bayes.GaussianNB'> does not support it. Ignoring this setting.


{}
{}

Random state supplied, but alg <class 'sklearn.naive_bayes.GaussianNB'> does not support it. Ignoring this setting.





Random state supplied, but alg <class 'sklearn.naive_bayes.GaussianNB'> does not support it. Ignoring this setting.


{}


Random state supplied, but alg <class 'sklearn.naive_bayes.GaussianNB'> does not support it. Ignoring this setting.
Random state supplied, but alg <class 'sklearn.naive_bayes.GaussianNB'> does not support it. Ignoring this setting.


{}
{}


Random state supplied, but alg <class 'sklearn.naive_bayes.GaussianNB'> does not support it. Ignoring this setting.


{}


Random state supplied, but alg <class 'sklearn.naive_bayes.GaussianNB'> does not support it. Ignoring this setting.


{}


Random state supplied, but alg <class 'sklearn.naive_bayes.GaussianNB'> does not support it. Ignoring this setting.
Random state supplied, but alg <class 'sklearn.naive_bayes.GaussianNB'> does not support it. Ignoring this setting.


{}
{}
{}


Random state supplied, but alg <class 'sklearn.naive_bayes.GaussianNB'> does not support it. Ignoring this setting.
Random state supplied, but alg <class 'sklearn.naive_bayes.GaussianNB'> does not support it. Ignoring this setting.


{}
{}


Random state supplied, but alg <class 'sklearn.naive_bayes.GaussianNB'> does not support it. Ignoring this setting.
Random state supplied, but alg <class 'sklearn.naive_bayes.GaussianNB'> does not support it. Ignoring this setting.


{}


Random state supplied, but alg <class 'sklearn.naive_bayes.GaussianNB'> does not support it. Ignoring this setting.


{}


Random state supplied, but alg <class 'sklearn.naive_bayes.GaussianNB'> does not support it. Ignoring this setting.


{}


Random state supplied, but alg <class 'sklearn.naive_bayes.GaussianNB'> does not support it. Ignoring this setting.


{}


Random state supplied, but alg <class 'sklearn.naive_bayes.GaussianNB'> does not support it. Ignoring this setting.


{}


Random state supplied, but alg <class 'sklearn.naive_bayes.GaussianNB'> does not support it. Ignoring this setting.


{}


Random state supplied, but alg <class 'sklearn.naive_bayes.GaussianNB'> does not support it. Ignoring this setting.


{}


Random state supplied, but alg <class 'sklearn.naive_bayes.GaussianNB'> does not support it. Ignoring this setting.
Random state supplied, but alg <class 'sklearn.naive_bayes.GaussianNB'> does not support it. Ignoring this setting.


{}
{}


Random state supplied, but alg <class 'sklearn.naive_bayes.GaussianNB'> does not support it. Ignoring this setting.


{}


Random state supplied, but alg <class 'sklearn.naive_bayes.GaussianNB'> does not support it. Ignoring this setting.


{}


Random state supplied, but alg <class 'sklearn.naive_bayes.GaussianNB'> does not support it. Ignoring this setting.


{}


Unnamed: 0,Assessor,ScoreFunc,Score,TargetProperties,TargetTasks,ModelFile,Algorithm,AlgorithmParams,ReplicaID,DataSet,ReplicaFile
0,TestSetAssessor,roc_auc_score,0.707071,pchembl_value_Mean,SINGLECLASS,/home/sichom/projects/QSPRpred/tutorials/tutor...,GaussianNB,,TutorialBenchmark_2746317213,TutorialBenchmarkData_FingerprintSet_MorganFP_...,/home/sichom/projects/QSPRpred/tutorials/tutor...
1,TestSetAssessor,matthews_corrcoef,0.393939,pchembl_value_Mean,SINGLECLASS,/home/sichom/projects/QSPRpred/tutorials/tutor...,GaussianNB,,TutorialBenchmark_2746317213,TutorialBenchmarkData_FingerprintSet_MorganFP_...,/home/sichom/projects/QSPRpred/tutorials/tutor...
2,TestSetAssessor,roc_auc_score,0.772727,pchembl_value_Mean,SINGLECLASS,/home/sichom/projects/QSPRpred/tutorials/tutor...,ExtraTreesClassifier,"{""random_state"": 478163327}",TutorialBenchmark_478163327,TutorialBenchmarkData_FingerprintSet_MorganFP_...,/home/sichom/projects/QSPRpred/tutorials/tutor...
3,TestSetAssessor,matthews_corrcoef,0.287213,pchembl_value_Mean,SINGLECLASS,/home/sichom/projects/QSPRpred/tutorials/tutor...,ExtraTreesClassifier,"{""random_state"": 478163327}",TutorialBenchmark_478163327,TutorialBenchmarkData_FingerprintSet_MorganFP_...,/home/sichom/projects/QSPRpred/tutorials/tutor...
4,TestSetAssessor,roc_auc_score,0.757576,pchembl_value_Mean,SINGLECLASS,/home/sichom/projects/QSPRpred/tutorials/tutor...,GaussianNB,,TutorialBenchmark_1181241943,TutorialBenchmarkData_FingerprintSet_MorganFP,/home/sichom/projects/QSPRpred/tutorials/tutor...
...,...,...,...,...,...,...,...,...,...,...,...
355,TestSetAssessor,matthews_corrcoef,0.301511,pchembl_value_Mean,SINGLECLASS,/home/sichom/projects/QSPRpred/tutorials/tutor...,ExtraTreesClassifier,"{""random_state"": 3246059658}",TutorialBenchmark_3246059658,TutorialBenchmarkData_RDkit,/home/sichom/projects/QSPRpred/tutorials/tutor...
356,TestSetAssessor,roc_auc_score,0.755208,pchembl_value_Mean,SINGLECLASS,/home/sichom/projects/QSPRpred/tutorials/tutor...,GaussianNB,,TutorialBenchmark_202363285,TutorialBenchmarkData_RDkit,/home/sichom/projects/QSPRpred/tutorials/tutor...
357,TestSetAssessor,matthews_corrcoef,-0.068041,pchembl_value_Mean,SINGLECLASS,/home/sichom/projects/QSPRpred/tutorials/tutor...,GaussianNB,,TutorialBenchmark_202363285,TutorialBenchmarkData_RDkit,/home/sichom/projects/QSPRpred/tutorials/tutor...
358,TestSetAssessor,roc_auc_score,0.568182,pchembl_value_Mean,SINGLECLASS,/home/sichom/projects/QSPRpred/tutorials/tutor...,ExtraTreesClassifier,"{""random_state"": 3698408854}",TutorialBenchmark_3698408854,TutorialBenchmarkData_RDkit,/home/sichom/projects/QSPRpred/tutorials/tutor...


Note on reusing results: It is recommended to reuse results only when the settings of the runner have not changed. If they did, you might be running a risk of reusing irrelevant results because the same ID can be generated for a replica that defines a different experiment in the new settings.

The runner has finished and we can now load the results of our benchmark. It is a simple data frame:

In [9]:
df_results = pd.read_table(runner.resultsFile)
df_results.shape

(360, 11)

It is 2 times the number of replicas because we have two assessors:

In [10]:
2 * runner.nRuns

360

We can check the results for the first replica:

In [11]:
df_results.iloc[0, :]

Assessor                                              TestSetAssessor
ScoreFunc                                               roc_auc_score
Score                                                        0.707071
TargetProperties                                   pchembl_value_Mean
TargetTasks                                               SINGLECLASS
ModelFile           /home/sichom/projects/QSPRpred/tutorials/tutor...
Algorithm                                                  GaussianNB
AlgorithmParams                                                   NaN
ReplicaID                                TutorialBenchmark_2746317213
DataSet             TutorialBenchmarkData_FingerprintSet_MorganFP_...
ReplicaFile         /home/sichom/projects/QSPRpred/tutorials/tutor...
Name: 0, dtype: object

We can see that for this replica we were testing the `GaussianNB` model with the `CL` property and the `RandomSplit` strategy. We can see the `roc_auc` score in  the `Score` column. We can always recreate the replica from the saved file:

In [12]:
from qsprpred.benchmarks import Replica

replica = Replica.fromFile(df_results.iloc[0, :].ReplicaFile)
replica

<qsprpred.benchmarks.replica.Replica at 0x7f6cfaed18d0>

# Recreating Experiments

QSPRpred aims for maximum reproducibility of experiments. Therefore, each replica can be rerun after saving and experiments repeated:

In [13]:
# reinitialize data
replica.initData()

In [14]:
replica.ds.getDF()

Unnamed: 0_level_0,SMILES,pchembl_value_Mean,Year,QSPRID,pchembl_value_Mean_class
QSPRID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
TutorialBenchmarkData_000,CCCn1c(-c2ccccc2)nc2c1ncnc2NC1CCOC1,5.770,2018.0,TutorialBenchmarkData_000,False
TutorialBenchmarkData_001,CCCn1c(=O)c2c([nH]c(-c3c[nH]nc3)n2)n(CCC)c1=O,6.640,2006.0,TutorialBenchmarkData_001,True
TutorialBenchmarkData_002,COc1cccc2c1nc(N)n1nc(CN3CCN(c4ncc(F)cc4)CC3C)nc21,7.880,2015.0,TutorialBenchmarkData_002,True
TutorialBenchmarkData_003,COc1cccc(CCCC(=O)Nc2nc3c(cccc3)c(=O)s2)c1,6.940,2013.0,TutorialBenchmarkData_003,True
TutorialBenchmarkData_004,COc1c2nc(NC(=O)c3ccc(F)cc3)sc2c(N(CCO)C(C)=O)cc1,7.010,2010.0,TutorialBenchmarkData_004,True
...,...,...,...,...,...
TutorialBenchmarkData_095,Cc1ccc2c(O)c(C(=O)NCc3ccco3)cnc2n1,7.460,2005.0,TutorialBenchmarkData_095,True
TutorialBenchmarkData_096,CC(O)CCc1nc(N)c2nc(-n3nccn3)n(C)c2n1,7.102,2013.0,TutorialBenchmarkData_096,True
TutorialBenchmarkData_097,CCN(C(C)=O)c1ccc(Cl)c2c1sc(NC(=O)c1ccc(F)cc1)n2,7.820,2010.0,TutorialBenchmarkData_097,True
TutorialBenchmarkData_098,CS(=O)(=O)c1ccc(Cn2ncc3c2nc(N)nc3-c2ccco2)cc1,6.470,2008.0,TutorialBenchmarkData_098,False


In [15]:
# reinitialize descriptors
replica.addDescriptors()

In [16]:
replica.ds.getDescriptors()

Unnamed: 0_level_0,Descriptor_FingerprintSet_MorganFP_0,Descriptor_FingerprintSet_MorganFP_1,Descriptor_FingerprintSet_MorganFP_2,Descriptor_FingerprintSet_MorganFP_3,Descriptor_FingerprintSet_MorganFP_4,Descriptor_FingerprintSet_MorganFP_5,Descriptor_FingerprintSet_MorganFP_6,Descriptor_FingerprintSet_MorganFP_7,Descriptor_FingerprintSet_MorganFP_8,Descriptor_FingerprintSet_MorganFP_9,...,Descriptor_RDkit_fr_sulfonamd,Descriptor_RDkit_fr_sulfone,Descriptor_RDkit_fr_term_acetylene,Descriptor_RDkit_fr_tetrazole,Descriptor_RDkit_fr_thiazole,Descriptor_RDkit_fr_thiocyan,Descriptor_RDkit_fr_thiophene,Descriptor_RDkit_fr_unbrch_alkane,Descriptor_RDkit_fr_urea,Descriptor_RDkit_qed
QSPRID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
TutorialBenchmarkData_000,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.781396
TutorialBenchmarkData_001,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.736392
TutorialBenchmarkData_002,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.534122
TutorialBenchmarkData_003,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.733913
TutorialBenchmarkData_004,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.660116
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
TutorialBenchmarkData_095,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.768685
TutorialBenchmarkData_096,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.683816
TutorialBenchmarkData_097,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.699868
TutorialBenchmarkData_098,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.585306


It should be noted that the code above does not recalculate the descriptors, but reloads the data set saved here:

In [17]:
replica.ds.storeDir

'../../tutorial_output/benchmarking/data/TutorialBenchmarkData_FingerprintSet_MorganFP_RDkit'

In [18]:
# reproduce data preparation for replica
replica.prepData()

In [19]:
# this for example yields the split from before
train, test = replica.ds.getFeatures()
train.shape, test.shape

((80, 283), (20, 283))

In [20]:
# initialize model with the prepared data
replica.initModel()

Random state supplied, but alg <class 'sklearn.naive_bayes.GaussianNB'> does not support it. Ignoring this setting.


In [21]:
replica.model.alg

sklearn.naive_bayes.GaussianNB

In [22]:
# we can also run all the original assessments with this model
replica.runAssessment()

To results are saved withing the replica too:

In [23]:
replica.results

Unnamed: 0,Assessor,ScoreFunc,Score,TargetProperties,TargetTasks
0,TestSetAssessor,roc_auc_score,0.707071,pchembl_value_Mean,SINGLECLASS
0,TestSetAssessor,matthews_corrcoef,0.393939,pchembl_value_Mean,SINGLECLASS


Since the random seed was fixed to the original value, we should get the same results as in the original run:

In [24]:
df_results.iloc[0, :].Score

0.7070707070707071

In [25]:
df_results.iloc[1, :].Score

0.3939393939393939

In order to get the complete report on the replica results, the `replica.createReport()` method can be used:

In [26]:
replica.createReport()

Unnamed: 0,Assessor,ScoreFunc,Score,TargetProperties,TargetTasks,ModelFile,Algorithm,AlgorithmParams,ReplicaID,DataSet,ReplicaFile
0,TestSetAssessor,roc_auc_score,0.707071,pchembl_value_Mean,SINGLECLASS,/home/sichom/projects/QSPRpred/tutorials/tutor...,GaussianNB,,TutorialBenchmark_2746317213,TutorialBenchmarkData_FingerprintSet_MorganFP_...,/home/sichom/projects/QSPRpred/tutorials/tutor...
0,TestSetAssessor,matthews_corrcoef,0.393939,pchembl_value_Mean,SINGLECLASS,/home/sichom/projects/QSPRpred/tutorials/tutor...,GaussianNB,,TutorialBenchmark_2746317213,TutorialBenchmarkData_FingerprintSet_MorganFP_...,/home/sichom/projects/QSPRpred/tutorials/tutor...


# Analyzing Results

The resulting data frame contains concatenated reports from each replica. We can get detailed information about model parameters, data preparation settings, target properties and more. It is up to you to decide how to analyze the results. You can also override the `createReport()` method of the `Replica` class and change the `iterReplicas` of `BenchmarkRunner` to use your implementation.

Here we just provide a few examples of retrieving data from the data frame we generated before:

In [27]:
import pandas as pd

OUT_DIR = '../../tutorial_output/benchmarking'

# reload the data manually this time to avoid recalculating everything (the above code needs to be executed at least once)
df_results = pd.read_table(f'{OUT_DIR}/results.tsv')
df_results 

Unnamed: 0,Assessor,ScoreFunc,Score,TargetProperties,TargetTasks,ModelFile,Algorithm,AlgorithmParams,ReplicaID,DataSet,ReplicaFile
0,TestSetAssessor,roc_auc_score,0.707071,pchembl_value_Mean,SINGLECLASS,/home/sichom/projects/QSPRpred/tutorials/tutor...,GaussianNB,,TutorialBenchmark_2746317213,TutorialBenchmarkData_FingerprintSet_MorganFP_...,/home/sichom/projects/QSPRpred/tutorials/tutor...
1,TestSetAssessor,matthews_corrcoef,0.393939,pchembl_value_Mean,SINGLECLASS,/home/sichom/projects/QSPRpred/tutorials/tutor...,GaussianNB,,TutorialBenchmark_2746317213,TutorialBenchmarkData_FingerprintSet_MorganFP_...,/home/sichom/projects/QSPRpred/tutorials/tutor...
2,TestSetAssessor,roc_auc_score,0.772727,pchembl_value_Mean,SINGLECLASS,/home/sichom/projects/QSPRpred/tutorials/tutor...,ExtraTreesClassifier,"{""random_state"": 478163327}",TutorialBenchmark_478163327,TutorialBenchmarkData_FingerprintSet_MorganFP_...,/home/sichom/projects/QSPRpred/tutorials/tutor...
3,TestSetAssessor,matthews_corrcoef,0.287213,pchembl_value_Mean,SINGLECLASS,/home/sichom/projects/QSPRpred/tutorials/tutor...,ExtraTreesClassifier,"{""random_state"": 478163327}",TutorialBenchmark_478163327,TutorialBenchmarkData_FingerprintSet_MorganFP_...,/home/sichom/projects/QSPRpred/tutorials/tutor...
4,TestSetAssessor,roc_auc_score,0.757576,pchembl_value_Mean,SINGLECLASS,/home/sichom/projects/QSPRpred/tutorials/tutor...,GaussianNB,,TutorialBenchmark_1181241943,TutorialBenchmarkData_FingerprintSet_MorganFP,/home/sichom/projects/QSPRpred/tutorials/tutor...
...,...,...,...,...,...,...,...,...,...,...,...
355,TestSetAssessor,matthews_corrcoef,0.301511,pchembl_value_Mean,SINGLECLASS,/home/sichom/projects/QSPRpred/tutorials/tutor...,ExtraTreesClassifier,"{""random_state"": 3246059658}",TutorialBenchmark_3246059658,TutorialBenchmarkData_RDkit,/home/sichom/projects/QSPRpred/tutorials/tutor...
356,TestSetAssessor,roc_auc_score,0.755208,pchembl_value_Mean,SINGLECLASS,/home/sichom/projects/QSPRpred/tutorials/tutor...,GaussianNB,,TutorialBenchmark_202363285,TutorialBenchmarkData_RDkit,/home/sichom/projects/QSPRpred/tutorials/tutor...
357,TestSetAssessor,matthews_corrcoef,-0.068041,pchembl_value_Mean,SINGLECLASS,/home/sichom/projects/QSPRpred/tutorials/tutor...,GaussianNB,,TutorialBenchmark_202363285,TutorialBenchmarkData_RDkit,/home/sichom/projects/QSPRpred/tutorials/tutor...
358,TestSetAssessor,roc_auc_score,0.568182,pchembl_value_Mean,SINGLECLASS,/home/sichom/projects/QSPRpred/tutorials/tutor...,ExtraTreesClassifier,"{""random_state"": 3698408854}",TutorialBenchmark_3698408854,TutorialBenchmarkData_RDkit,/home/sichom/projects/QSPRpred/tutorials/tutor...


In [28]:
# we used different models, descriptors and splits
# probably makes sense to first get info about splits
# not in the original report, but...

from qsprpred.benchmarks import Replica

def get_split_info(replica):
    replica = Replica.fromFile(replica)
    return str(replica.prepSettings.split.__class__.__name__)

df_results['Split'] = df_results.ReplicaFile.apply(get_split_info)

In [29]:
# we have info about data set, which is ok to describe our combination of descriptors
df_results['Descriptors'] = df_results['DataSet'].apply(lambda x: "_".join(x.split('_')[1:]))
df_results['Descriptors']

0      FingerprintSet_MorganFP_RDkit
1      FingerprintSet_MorganFP_RDkit
2      FingerprintSet_MorganFP_RDkit
3      FingerprintSet_MorganFP_RDkit
4            FingerprintSet_MorganFP
                   ...              
355                            RDkit
356                            RDkit
357                            RDkit
358                            RDkit
359                            RDkit
Name: Descriptors, Length: 360, dtype: object

In [30]:
import seaborn as sns
from matplotlib import pyplot as plt

# we can now create a couple informative box plots
# check the output directory to see the figures

def make_box_plot(data, x, y, hue, plot_name="boxplot"):
    # generate one plot for each metric
    for score_func in df_results.ScoreFunc.unique():
        df_ind = df_results.loc[(df_results.ScoreFunc == score_func)]
        plt.ylim([0, 1])
        plt.title(score_func)
        sns.boxplot(
            data=df_ind,
            x=x,
            y=y,
            hue=hue,
            palette=sns.color_palette('bright')
        )
        plt.savefig(f"{OUT_DIR}/{plot_name}_{score_func}_{x}_{y}_{hue}.png")
        plt.clf()
        plt.close()

# comparison of descriptors and their influence on each model's performance in cluster split
make_box_plot(
    df_results[df_results.Split == "ClusterSplit"],
    x="Algorithm",
    y="Score",
    hue="Descriptors",
    plot_name="ClusterSplit"
)
# comparison of performance for different splitting strategies for each model (cluster split clearly more difficult)
# TODO: we could follow this up with a tutorial to integrate the AVE bias (still in qsp-bench, the tools.py script)
make_box_plot(
    df_results[df_results.Algorithm == "GaussianNB"],
    x="Split",
    y="Score",
    hue="Descriptors",
    plot_name="GaussianNB"
)
make_box_plot(
    df_results[df_results.Algorithm == "ExtraTreesClassifier"],
    x="Split",
    y="Score",
    hue="Descriptors",
    plot_name="ExtraTreesClassifier"
)

  sns.boxplot(
  sns.boxplot(
  sns.boxplot(
  sns.boxplot(
  sns.boxplot(
  sns.boxplot(
