# Benchmarking QSPR Models and More

The `qsprpred.benchmarks` module provides a set of functions to compare machine learning algorithms, preparation steps, hyperparameter optimization strategies, molecular descriptors and more on various data sets that are already prepared or of your own choosing. In this tutorial, we will provide a simple example to benchmark a selection of models and descriptors on a simple task using one of the data sets provided with this tutorial.

## Choosing the Data Set

The first step is to decide, which data set(s) to use for your benchmark. To make things a little faster, we will write a custom data source that will sample a smaller benchmarking set from our tutorial data set:

In [1]:
import os

import pandas as pd

from qsprpred.data import MoleculeTable
from qsprpred.data.sources import DataSource

BASE_DIR = "../../tutorial_output/benchmarking"  # directory to store all benchmarking results and files
os.makedirs(BASE_DIR, exist_ok=True) # make sure it exists
SEED = 42  # random seed for all random operations, you should always get the same results with the same settings


class DataSourceTesting(DataSource):
    """
    Just a simple wrapper around our tutorial data set.
    """

    def __init__(self, name: str, base_dir: str):
        self.name = name # name of the created data set
        self.baseDir = base_dir # where to save it and all its derived data sets

    def getData(self, name: str | None = None, **kwargs) -> MoleculeTable:
        """We just need to create a simple `MoleculeTable` here.
        Defining target properties is not necessary at thist point 
        because they will be set as part of the benchmark paremetrization. 
        To make things faster we will also sample only 100 molecules.
        Note that this method could also provide different data sets
        based on the `self.name` or other parameters. It also does not
        include descriptors, which are added as required creating derived
        data sets in `self.baseDir`.
        """
        name = name or self.name
        return MoleculeTable(
            df=pd.read_table('../../tutorial_data/A2A_LIGANDS.tsv').sample(
                100,
                random_state=SEED
            ), # source data frame
            name=name, # 
            store_dir=self.baseDir,
            **kwargs
        )


source = DataSourceTesting("TutorialBenchmarkData", f"{BASE_DIR}/data")
example_ds = source.getData()
example_ds.getDF()

Unnamed: 0_level_0,SMILES,pchembl_value_Mean,Year,QSPRID
QSPRID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
TutorialBenchmarkData_000,CCCn1c(-c2ccccc2)nc2c1ncnc2NC1CCOC1,5.770,2018.0,TutorialBenchmarkData_000
TutorialBenchmarkData_001,CCCn1c(=O)c2c([nH]c(-c3c[nH]nc3)n2)n(CCC)c1=O,6.640,2006.0,TutorialBenchmarkData_001
TutorialBenchmarkData_002,COc1cccc2c1nc(N)n1nc(CN3CCN(c4ncc(F)cc4)CC3C)nc21,7.880,2015.0,TutorialBenchmarkData_002
TutorialBenchmarkData_003,COc1cccc(CCCC(=O)Nc2nc3c(cccc3)c(=O)s2)c1,6.940,2013.0,TutorialBenchmarkData_003
TutorialBenchmarkData_004,COc1c2nc(NC(=O)c3ccc(F)cc3)sc2c(N(CCO)C(C)=O)cc1,7.010,2010.0,TutorialBenchmarkData_004
...,...,...,...,...
TutorialBenchmarkData_095,Cc1ccc2c(O)c(C(=O)NCc3ccco3)cnc2n1,7.460,2005.0,TutorialBenchmarkData_095
TutorialBenchmarkData_096,CC(O)CCc1nc(N)c2nc(-n3nccn3)n(C)c2n1,7.102,2013.0,TutorialBenchmarkData_096
TutorialBenchmarkData_097,CCN(C(C)=O)c1ccc(Cl)c2c1sc(NC(=O)c1ccc(F)cc1)n2,7.820,2010.0,TutorialBenchmarkData_097
TutorialBenchmarkData_098,CS(=O)(=O)c1ccc(Cn2ncc3c2nc(N)nc3-c2ccco2)cc1,6.470,2008.0,TutorialBenchmarkData_098


You can wrap your own data set in a `DataSource` class to use it for benchmarking. Just implement the `qsprpred.data.sources.data_source.DataSource` interface. Just make sure it containes properties that you specify as the `TargetProperty` later in the workflow. 

Next thing to do is to decide the benchmarking settings. You will use the `BenchmarkSettings` class to specify all details of your benchmark:

In [2]:
from qsprpred.data.descriptors.fingerprints import MorganFP
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.naive_bayes import GaussianNB
from qsprpred.models import SklearnModel, TestSetAssessor
from qsprpred.data.processing.feature_filters import LowVarianceFilter
from sklearn.preprocessing import StandardScaler
from qsprpred.data import RandomSplit, ClusterSplit
from qsprpred import TargetProperty, TargetTasks
from qsprpred.data.descriptors.sets import RDKitDescs
from qsprpred.benchmarks import BenchmarkSettings, DataPrepSettings

settings = BenchmarkSettings(
    name="TutorialBenchmark",
    n_replicas=3,
    # number of repeated experiments for statistical power, just 3 here for faster calculations
    random_seed=SEED,  # random seed for all random operations
    data_sources=[
        # one or more data sources to use for benchmarking
        source # we defined this one above
    ],
    descriptors=[
        # in this case we will test 3 different combinations of descriptors
        [
            MorganFP(
                radius=2, nBits=256
            ),
            RDKitDescs()
        ],
        [
            MorganFP(
                radius=2, nBits=256
            )
        ],
        [
            RDKitDescs()
        ]
    ],
    target_props=[
        # one or more properties to model, the table from our data source must contain them
        [
            TargetProperty.fromDict(
                {
                    "name": "pchembl_value_Mean",
                    "task": TargetTasks.SINGLECLASS,
                    "th": [6.5]
                }
            )
        ],
    ],
    prep_settings=[
        # we will compare two splitting strategies, each time with the same feature filter and standardizer
        DataPrepSettings(
            split=RandomSplit(test_fraction=0.2),  # random split
            feature_filters=[LowVarianceFilter(0.05)],
            feature_standardizer=StandardScaler()
        ),
        DataPrepSettings(
            split=ClusterSplit(test_fraction=0.2),  # scaffold split
            feature_filters=[LowVarianceFilter(0.05)],
            feature_standardizer=StandardScaler()
        ),
    ],
    models=[
        # the algorithms to benchmarks, this can be any implementation of `QSPRModel`
        SklearnModel(
            name="GaussianNB",
            alg=GaussianNB,
            base_dir=f"{BASE_DIR}/models",
        ),
        SklearnModel(
            name="ExtraTreesClassifier",
            alg=ExtraTreesClassifier,
            base_dir=f"{BASE_DIR}/models",
        )
    ],
    assessors=[
        TestSetAssessor(scoring="roc_auc"),
        TestSetAssessor(scoring="matthews_corrcoef", use_proba=False),
    ]
)

These settings can be saved and reloaded as needed in the JSON format:

In [3]:
BenchmarkSettings.fromFile(settings.toFile(f"{BASE_DIR}/settings.json"))

BenchmarkSettings(name='TutorialBenchmark', n_replicas=3, random_seed=42, data_sources=[<__main__.DataSourceTesting object at 0x7fe8f2bb4b50>], descriptors=[[<qsprpred.data.descriptors.fingerprints.MorganFP object at 0x7fe8f2bb4640>, <qsprpred.data.descriptors.sets.RDKitDescs object at 0x7fe8f2bb4610>], [<qsprpred.data.descriptors.fingerprints.MorganFP object at 0x7fe8f2bb4520>], [<qsprpred.data.descriptors.sets.RDKitDescs object at 0x7fe8f2bb5810>]], target_props=[[TargetProperty(name=pchembl_value_Mean, task=SINGLECLASS, th=[6.5])]], prep_settings=[DataPrepSettings(data_filters=(<qsprpred.data.processing.data_filters.RepeatsFilter object at 0x7fe8f2bb4550>,), split=<qsprpred.data.sampling.splits.RandomSplit object at 0x7fe8f2bb5a50>, smiles_standardizer='chembl', feature_filters=[<qsprpred.data.processing.feature_filters.LowVarianceFilter object at 0x7fe8f2bb63b0>], feature_standardizer=StandardScaler(), feature_fill_value=0.0, shuffle=True), DataPrepSettings(data_filters=(<qsprpred.

The settings can then be used to instantiate a runner that handles the creation of different replicas and data sets with the appropriate descriptors:

In [4]:
from qsprpred.benchmarks import BenchmarkRunner

runner = BenchmarkRunner(
    settings, # our settings
    data_dir=BASE_DIR # where to save outputs
)

To get an idea on computational complexity, we can check how many replicas we will run:

In [5]:
runner.nRuns

36

This is the number of all modeling workflows that the runner will execute. We can check the details of each one by iterating over the replicas:

In [6]:
for replica in runner.iterReplicas():
    # replicas contain all info needed to run an experiment
    print(replica)
    print(replica.dataSource.name)
    print(replica.descriptors)
    print(replica.targetProps)
    print(replica.prepSettings.split)
    break # only show the first replica

TutorialBenchmark_2746317213
TutorialBenchmarkData
[<qsprpred.data.descriptors.fingerprints.MorganFP object at 0x7fe8f2bb47f0>, <qsprpred.data.descriptors.sets.RDKitDescs object at 0x7fe8f2bb4790>]
[TargetProperty(name=pchembl_value_Mean, task=SINGLECLASS, th=[6.5])]
<qsprpred.data.sampling.splits.RandomSplit object at 0x7fe8f1c3cd60>


Replicas also have methods to perform different stages of an experiment, but the runner executes them in order for us and also distributes the calculations over the available CPUs:

In [7]:
import shutil

if os.path.exists(BASE_DIR):
    # start fresh (not needed if you want to reuse results from the previous runs)
    shutil.rmtree(BASE_DIR, ignore_errors=True)

# this will start the actual benchmarking
runner.run(raise_errors=True) # the runner will not continue if an error is raised



Unnamed: 0,Assessor,ScoreFunc,Score,TargetProperty,TargetTask,ModelFile,Algorithm,AlgorithmParams,ReplicaID,DataSet,ReplicaFile
0,TestSetAssessor,roc_auc_score,0.707071,pchembl_value_Mean,SINGLECLASS,/home/sichom/projects/QSPRpred/tutorials/tutor...,GaussianNB,,TutorialBenchmark_2746317213,TutorialBenchmarkData_MorganFP_RDkit,/home/sichom/projects/QSPRpred/tutorials/tutor...
1,TestSetAssessor,matthews_corrcoef,0.393939,pchembl_value_Mean,SINGLECLASS,/home/sichom/projects/QSPRpred/tutorials/tutor...,GaussianNB,,TutorialBenchmark_2746317213,TutorialBenchmarkData_MorganFP_RDkit,/home/sichom/projects/QSPRpred/tutorials/tutor...
2,TestSetAssessor,roc_auc_score,0.598901,pchembl_value_Mean,SINGLECLASS,/home/sichom/projects/QSPRpred/tutorials/tutor...,ExtraTreesClassifier,"{""random_state"": 478163327}",TutorialBenchmark_478163327,TutorialBenchmarkData_MorganFP_RDkit,/home/sichom/projects/QSPRpred/tutorials/tutor...
3,TestSetAssessor,matthews_corrcoef,-0.179106,pchembl_value_Mean,SINGLECLASS,/home/sichom/projects/QSPRpred/tutorials/tutor...,ExtraTreesClassifier,"{""random_state"": 478163327}",TutorialBenchmark_478163327,TutorialBenchmarkData_MorganFP_RDkit,/home/sichom/projects/QSPRpred/tutorials/tutor...
4,TestSetAssessor,roc_auc_score,0.590909,pchembl_value_Mean,SINGLECLASS,/home/sichom/projects/QSPRpred/tutorials/tutor...,GaussianNB,,TutorialBenchmark_107420369,TutorialBenchmarkData_MorganFP_RDkit,/home/sichom/projects/QSPRpred/tutorials/tutor...
...,...,...,...,...,...,...,...,...,...,...,...
67,TestSetAssessor,matthews_corrcoef,-0.089087,pchembl_value_Mean,SINGLECLASS,/home/sichom/projects/QSPRpred/tutorials/tutor...,ExtraTreesClassifier,"{""random_state"": 1801823908}",TutorialBenchmark_1801823908,TutorialBenchmarkData_MorganFP,/home/sichom/projects/QSPRpred/tutorials/tutor...
68,TestSetAssessor,roc_auc_score,0.646465,pchembl_value_Mean,SINGLECLASS,/home/sichom/projects/QSPRpred/tutorials/tutor...,GaussianNB,,TutorialBenchmark_2530876844,TutorialBenchmarkData_RDkit,/home/sichom/projects/QSPRpred/tutorials/tutor...
69,TestSetAssessor,matthews_corrcoef,0.504430,pchembl_value_Mean,SINGLECLASS,/home/sichom/projects/QSPRpred/tutorials/tutor...,GaussianNB,,TutorialBenchmark_2530876844,TutorialBenchmarkData_RDkit,/home/sichom/projects/QSPRpred/tutorials/tutor...
70,TestSetAssessor,roc_auc_score,0.776042,pchembl_value_Mean,SINGLECLASS,/home/sichom/projects/QSPRpred/tutorials/tutor...,ExtraTreesClassifier,"{""random_state"": 1194819984}",TutorialBenchmark_1194819984,TutorialBenchmarkData_RDkit,/home/sichom/projects/QSPRpred/tutorials/tutor...


**Note on reusing results:** It is recommended to reuse results only when the settings of the runner have not changed. If they did, you might be running a risk of reusing irrelevant results because the same ID can be generated for a replica that defines a different experiment in the new settings. Since this tutorial is for experimentation, we remmove the entire result directory (`BASE_DIR`) and start fresh each time. But you could also leave your results in place and the runner will continue from the last unfinished replica, but beware of the danger above when settings are changed.
 
When the runner has finished we can load the results of our benchmark. It is a simple data frame:

In [8]:
df_results = pd.read_table(runner.resultsFile)
df_results.shape

(72, 11)

It is 2 times the number of replicas because we have two test set assessors:

In [9]:
2 * runner.nRuns

72

We can check the results for the first replica:

In [10]:
df_results.iloc[0, :]

Assessor                                             TestSetAssessor
ScoreFunc                                              roc_auc_score
Score                                                       0.707071
TargetProperty                                    pchembl_value_Mean
TargetTask                                               SINGLECLASS
ModelFile          /home/sichom/projects/QSPRpred/tutorials/tutor...
Algorithm                                                 GaussianNB
AlgorithmParams                                                  NaN
ReplicaID                               TutorialBenchmark_2746317213
DataSet                         TutorialBenchmarkData_MorganFP_RDkit
ReplicaFile        /home/sichom/projects/QSPRpred/tutorials/tutor...
Name: 0, dtype: object

We can see that for this replica we were testing the `GaussianNB` model with the `CL` property and the `RandomSplit` strategy. We can also see the `roc_auc_score` metric was measured here and its value is in the `Score` column. Note the `ReplicaFile` property as well, which we can always use to recreate the replica instance:

In [11]:
from qsprpred.benchmarks import Replica

replica = Replica.fromFile(df_results.iloc[0, :].ReplicaFile)
replica

<qsprpred.benchmarks.replica.Replica at 0x7fe8f1abdc00>

# Recreating Experiments

QSPRpred aims for maximum reproducibility of experiments. Therefore, each replica can be rerun after saving and experiments repeated:

In [12]:
# reinitialize data
replica.initData()

In [13]:
replica.ds.getDF()

Unnamed: 0_level_0,SMILES,pchembl_value_Mean,Year,QSPRID,pchembl_value_Mean_original
QSPRID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
TutorialBenchmarkData_000,CCCn1c(-c2ccccc2)nc2c1ncnc2NC1CCOC1,False,2018.0,TutorialBenchmarkData_000,5.770
TutorialBenchmarkData_001,CCCn1c(=O)c2c([nH]c(-c3c[nH]nc3)n2)n(CCC)c1=O,True,2006.0,TutorialBenchmarkData_001,6.640
TutorialBenchmarkData_002,COc1cccc2c1nc(N)n1nc(CN3CCN(c4ncc(F)cc4)CC3C)nc21,True,2015.0,TutorialBenchmarkData_002,7.880
TutorialBenchmarkData_003,COc1cccc(CCCC(=O)Nc2nc3c(cccc3)c(=O)s2)c1,True,2013.0,TutorialBenchmarkData_003,6.940
TutorialBenchmarkData_004,COc1c2nc(NC(=O)c3ccc(F)cc3)sc2c(N(CCO)C(C)=O)cc1,True,2010.0,TutorialBenchmarkData_004,7.010
...,...,...,...,...,...
TutorialBenchmarkData_095,Cc1ccc2c(O)c(C(=O)NCc3ccco3)cnc2n1,True,2005.0,TutorialBenchmarkData_095,7.460
TutorialBenchmarkData_096,CC(O)CCc1nc(N)c2nc(-n3nccn3)n(C)c2n1,True,2013.0,TutorialBenchmarkData_096,7.102
TutorialBenchmarkData_097,CCN(C(C)=O)c1ccc(Cl)c2c1sc(NC(=O)c1ccc(F)cc1)n2,True,2010.0,TutorialBenchmarkData_097,7.820
TutorialBenchmarkData_098,CS(=O)(=O)c1ccc(Cn2ncc3c2nc(N)nc3-c2ccco2)cc1,False,2008.0,TutorialBenchmarkData_098,6.470


In [14]:
# reinitialize descriptors
replica.addDescriptors()

In [15]:
replica.ds.getDescriptors()

Unnamed: 0_level_0,MorganFP_0,MorganFP_1,MorganFP_2,MorganFP_3,MorganFP_4,MorganFP_5,MorganFP_6,MorganFP_7,MorganFP_8,MorganFP_9,...,fr_sulfonamd,fr_sulfone,fr_term_acetylene,fr_tetrazole,fr_thiazole,fr_thiocyan,fr_thiophene,fr_unbrch_alkane,fr_urea,qed
QSPRID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
TutorialBenchmarkData_000,False,False,False,False,False,False,True,True,False,False,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.781396
TutorialBenchmarkData_001,False,False,False,True,False,False,False,False,False,False,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.736392
TutorialBenchmarkData_002,True,False,False,False,False,False,False,False,True,False,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.534122
TutorialBenchmarkData_003,True,False,False,False,False,False,False,False,False,False,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.733913
TutorialBenchmarkData_004,True,False,False,False,False,False,False,False,False,False,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.660116
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
TutorialBenchmarkData_095,False,False,False,False,False,False,False,False,False,False,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.768685
TutorialBenchmarkData_096,False,True,False,False,False,False,False,False,False,False,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.683816
TutorialBenchmarkData_097,False,False,False,False,False,False,False,False,False,False,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.699868
TutorialBenchmarkData_098,False,False,False,False,False,False,False,True,True,False,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.585306


It should be noted that the code above does not recalculate the descriptors, but rather reloads the data set saved here:

In [16]:
replica.ds.storeDir

'../../tutorial_output/benchmarking/data/TutorialBenchmarkData_MorganFP_RDkit'

In [17]:
# reproduce data preparation
replica.prepData()

In [18]:
# this for example yields the split from before
train, test = replica.ds.getFeatures()
train.shape, test.shape

((80, 283), (20, 283))

In [19]:
# initialize model with the prepared data
replica.initModel()
replica.model.alg



sklearn.naive_bayes.GaussianNB

In [20]:
# run assesments
replica.runAssessment()

To results are saved withing the replica too:

In [21]:
replica.results

Unnamed: 0,Assessor,ScoreFunc,Score,TargetProperty,TargetTask
0,TestSetAssessor,roc_auc_score,0.707071,pchembl_value_Mean,SINGLECLASS
0,TestSetAssessor,matthews_corrcoef,0.393939,pchembl_value_Mean,SINGLECLASS


Since the random seed was fixed to the original value, we should get the same results as in the original run:

In [22]:
df_results.iloc[0, :].Score, df_results.iloc[1, :].Score

(0.7070707070707071, 0.3939393939393939)

In order to get the complete report on the reproduced replica, the `replica.createReport()` method can be used:

In [23]:
replica.createReport()

Unnamed: 0,Assessor,ScoreFunc,Score,TargetProperty,TargetTask,ModelFile,Algorithm,AlgorithmParams,ReplicaID,DataSet,ReplicaFile
0,TestSetAssessor,roc_auc_score,0.707071,pchembl_value_Mean,SINGLECLASS,/home/sichom/projects/QSPRpred/tutorials/tutor...,GaussianNB,,TutorialBenchmark_2746317213,TutorialBenchmarkData_MorganFP_RDkit,/home/sichom/projects/QSPRpred/tutorials/tutor...
0,TestSetAssessor,matthews_corrcoef,0.393939,pchembl_value_Mean,SINGLECLASS,/home/sichom/projects/QSPRpred/tutorials/tutor...,GaussianNB,,TutorialBenchmark_2746317213,TutorialBenchmarkData_MorganFP_RDkit,/home/sichom/projects/QSPRpred/tutorials/tutor...


# Analyzing Results

The resulting data frame contains concatenated reports from each replica. We can get detailed information about model parameters, data preparation settings, target properties and more. It is up to you to decide how to analyze the results. You can also override the `createReport()` method of the `Replica` class and change the `iterReplicas` of `BenchmarkRunner` to use your own implementation of various benchmarking steps.

Here we just provide a few examples of retrieving various data to plot from the report we obtained here:

In [24]:
import pandas as pd

OUT_DIR = '../../tutorial_output/benchmarking'

# reload the data manually this time to avoid recalculating everything (the above code needs to be executed at least once)
df_results = pd.read_table(f'{OUT_DIR}/results.tsv')
df_results

Unnamed: 0,Assessor,ScoreFunc,Score,TargetProperty,TargetTask,ModelFile,Algorithm,AlgorithmParams,ReplicaID,DataSet,ReplicaFile
0,TestSetAssessor,roc_auc_score,0.707071,pchembl_value_Mean,SINGLECLASS,/home/sichom/projects/QSPRpred/tutorials/tutor...,GaussianNB,,TutorialBenchmark_2746317213,TutorialBenchmarkData_MorganFP_RDkit,/home/sichom/projects/QSPRpred/tutorials/tutor...
1,TestSetAssessor,matthews_corrcoef,0.393939,pchembl_value_Mean,SINGLECLASS,/home/sichom/projects/QSPRpred/tutorials/tutor...,GaussianNB,,TutorialBenchmark_2746317213,TutorialBenchmarkData_MorganFP_RDkit,/home/sichom/projects/QSPRpred/tutorials/tutor...
2,TestSetAssessor,roc_auc_score,0.598901,pchembl_value_Mean,SINGLECLASS,/home/sichom/projects/QSPRpred/tutorials/tutor...,ExtraTreesClassifier,"{""random_state"": 478163327}",TutorialBenchmark_478163327,TutorialBenchmarkData_MorganFP_RDkit,/home/sichom/projects/QSPRpred/tutorials/tutor...
3,TestSetAssessor,matthews_corrcoef,-0.179106,pchembl_value_Mean,SINGLECLASS,/home/sichom/projects/QSPRpred/tutorials/tutor...,ExtraTreesClassifier,"{""random_state"": 478163327}",TutorialBenchmark_478163327,TutorialBenchmarkData_MorganFP_RDkit,/home/sichom/projects/QSPRpred/tutorials/tutor...
4,TestSetAssessor,roc_auc_score,0.590909,pchembl_value_Mean,SINGLECLASS,/home/sichom/projects/QSPRpred/tutorials/tutor...,GaussianNB,,TutorialBenchmark_107420369,TutorialBenchmarkData_MorganFP_RDkit,/home/sichom/projects/QSPRpred/tutorials/tutor...
...,...,...,...,...,...,...,...,...,...,...,...
67,TestSetAssessor,matthews_corrcoef,-0.089087,pchembl_value_Mean,SINGLECLASS,/home/sichom/projects/QSPRpred/tutorials/tutor...,ExtraTreesClassifier,"{""random_state"": 1801823908}",TutorialBenchmark_1801823908,TutorialBenchmarkData_MorganFP,/home/sichom/projects/QSPRpred/tutorials/tutor...
68,TestSetAssessor,roc_auc_score,0.646465,pchembl_value_Mean,SINGLECLASS,/home/sichom/projects/QSPRpred/tutorials/tutor...,GaussianNB,,TutorialBenchmark_2530876844,TutorialBenchmarkData_RDkit,/home/sichom/projects/QSPRpred/tutorials/tutor...
69,TestSetAssessor,matthews_corrcoef,0.504430,pchembl_value_Mean,SINGLECLASS,/home/sichom/projects/QSPRpred/tutorials/tutor...,GaussianNB,,TutorialBenchmark_2530876844,TutorialBenchmarkData_RDkit,/home/sichom/projects/QSPRpred/tutorials/tutor...
70,TestSetAssessor,roc_auc_score,0.776042,pchembl_value_Mean,SINGLECLASS,/home/sichom/projects/QSPRpred/tutorials/tutor...,ExtraTreesClassifier,"{""random_state"": 1194819984}",TutorialBenchmark_1194819984,TutorialBenchmarkData_RDkit,/home/sichom/projects/QSPRpred/tutorials/tutor...


In [25]:
# we used different models, descriptors and splits
# probably makes sense to first get info about splits
# not in the original report, but...

from qsprpred.benchmarks import Replica


def get_split_info(replica):
    replica = Replica.fromFile(replica)
    return str(replica.prepSettings.split.__class__.__name__)


df_results['Split'] = df_results.ReplicaFile.apply(get_split_info)

In [26]:
# we have info about data set, which is ok to describe our combination of descriptors
df_results['Descriptors'] = df_results['DataSet'].apply(
    lambda x: "_".join(x.split('_')[1:]))
df_results['Descriptors']

0     MorganFP_RDkit
1     MorganFP_RDkit
2     MorganFP_RDkit
3     MorganFP_RDkit
4     MorganFP_RDkit
           ...      
67          MorganFP
68             RDkit
69             RDkit
70             RDkit
71             RDkit
Name: Descriptors, Length: 72, dtype: object

In [27]:
import seaborn as sns
from matplotlib import pyplot as plt


# we can now create a couple informative box plots
# check the output directory to see the figures

def make_box_plot(data, x, y, hue, plot_name="boxplot"):
    # generate one plot for each metric
    for score_func in df_results.ScoreFunc.unique():
        df_ind = df_results.loc[(df_results.ScoreFunc == score_func)]
        plt.ylim([0, 1])
        plt.title(score_func)
        sns.boxplot(
            data=df_ind,
            x=x,
            y=y,
            hue=hue,
        )
        plt.savefig(f"{OUT_DIR}/{plot_name}_{score_func}_{x}_{y}_{hue}.png")
        plt.clf()
        plt.close()


# comparison of descriptors and their influence on each model's performance in cluster split
make_box_plot(
    df_results[df_results.Split == "ClusterSplit"],
    x="Algorithm",
    y="Score",
    hue="Descriptors",
    plot_name="ClusterSplit"
)
# comparison of performance for different splitting strategies for each model (cluster split clearly more difficult)
# TODO: we could follow this up with a tutorial to integrate the AVE bias (still in qsp-bench, the tools.py script)
make_box_plot(
    df_results[df_results.Algorithm == "GaussianNB"],
    x="Split",
    y="Score",
    hue="Descriptors",
    plot_name="GaussianNB"
)
make_box_plot(
    df_results[df_results.Algorithm == "ExtraTreesClassifier"],
    x="Split",
    y="Score",
    hue="Descriptors",
    plot_name="ExtraTreesClassifier"
)