# Model assessment

To assess the performance of the QSPR models, there are two methods available in `QSPRpred`
from the [`assessment_methods`](https://cddleiden.github.io/QSPRpred/docs/api/qsprpred.models.html#module-qsprpred.models.assessment_methods) module:
* CrossValAssessor: Performs cross validation on a dataset
* TestSetAssessor: Performs predictions on a test set

In this notebook, we will demonstrate how to use these methods.
Let's start by loading the data and creating the model we want to assess.

In [10]:
import os
import pandas as pd
from IPython.display import display
from qsprpred.data.data import QSPRDataset
from qsprpred.data.utils.descriptorsets import FingerprintSet
from qsprpred.data.utils.descriptorcalculator import MoleculeDescriptorsCalculator
from qsprpred.data.utils.datasplitters import RandomSplit

df = pd.read_csv('../../tutorial_data/A2A_LIGANDS.tsv', sep='\t')

os.makedirs("../../tutorial_output/data", exist_ok=True)

dataset = QSPRDataset(
  	df=df, 
  	store_dir="../../tutorial_output/data",
  	name="A2A_LIGANDS",
  	target_props=[{"name": "pchembl_value_Mean", "task": "REGRESSION"}],
  	random_state=42
)

display(dataset.getDF())


# Calculate MorganFP and physicochemical properties
feature_calculator = MoleculeDescriptorsCalculator(desc_sets = [FingerprintSet(fingerprint_type="MorganFP", radius=3, nBits=2048)])

# Do a random split for creating the train (85%) and test set (15%)
rand_split = RandomSplit(test_fraction=0.2, dataset=dataset)

# calculate compound features and split dataset into train and test
dataset.prepareDataset(
    split=rand_split,
    feature_calculators=[feature_calculator],
    recalculate_features=True,
)

dataset.getDF().head()

from qsprpred.models.sklearn import SklearnModel
from sklearn.neighbors import KNeighborsRegressor

os.makedirs("tutorial_output/models", exist_ok=True)

# This is an SKlearn model, so we will initialize it with the SklearnModel class
model = SklearnModel(
    base_dir = 'tutorial_output/models',
    data = dataset,
    alg = KNeighborsRegressor,
    name = 'KNN_REG'
)

Unnamed: 0_level_0,SMILES,pchembl_value_Mean,Year,QSPRID
QSPRID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
A2A_LIGANDS_0,Cc1nn(-c2cc(NC(=O)CCN(C)C)nc(-c3ccc(C)o3)n2)c(...,8.68,2008.0,A2A_LIGANDS_0
A2A_LIGANDS_1,Nc1c(C(=O)Nc2ccc([N+](=O)[O-])cc2)sc2c1cc1CCCC...,4.82,2010.0,A2A_LIGANDS_1
A2A_LIGANDS_2,O=C(Nc1nc2ncccc2n2c(=O)n(-c3ccccc3)nc12)c1ccccc1,5.65,2009.0,A2A_LIGANDS_2
A2A_LIGANDS_3,CNC(=O)C12CC1C(n1cnc3c1nc(C#CCCCCC(=O)OC)nc3NC...,5.45,2009.0,A2A_LIGANDS_3
A2A_LIGANDS_4,CCCn1c(=O)c2c(nc3cc(OC)ccn32)n(CCCNC(=O)c2ccc(...,5.20,2019.0,A2A_LIGANDS_4
...,...,...,...,...
A2A_LIGANDS_4077,CNc1ncc(C(=O)NCc2ccc(OC)cc2)c2nc(-c3ccco3)nn12,7.09,2018.0,A2A_LIGANDS_4077
A2A_LIGANDS_4078,Nc1nc(-c2ccco2)c2ncn(C(=O)NCCc3ccccc3)c2n1,8.22,2008.0,A2A_LIGANDS_4078
A2A_LIGANDS_4079,Nc1nc(Nc2ccc(F)cc2)nc(CSc2nnc(N)s2)n1,4.89,2010.0,A2A_LIGANDS_4079
A2A_LIGANDS_4080,CCCOc1ccc(C=Cc2cc3c(c(=O)n(C)c(=O)n3C)n2C)cc1,6.51,2013.0,A2A_LIGANDS_4080


Now we will assess our model using cross validation. We will use the `CrossValAssessor` class for this.
The `CrossValAssessor` class takes a model and a dataset as input. The dataset is split into a number of folds, and the model is trained on each fold.
The performance of the model is then assessed on the fold that was not used for training,
and the results are stored in a file `{model_name}_cv.tsv`.
To score the performance of the model, we need to provide a scoring function. This function should take the true values and the predicted values as input, and return a score. The scores are returned by the function.
TODO: Add more info about scoring functions

In [11]:
from qsprpred.models.assessment_methods import CrossValAssessor
from qsprpred.models.metrics import SklearnMetric

score_func = SklearnMetric.getMetric(name='r2')

# Create a CrossValAssessor object
CrossValAssessor(score_func)(model)


[0.6381005757810776,
 0.6470413968481483,
 0.6722435248029048,
 0.6588247607859833,
 0.5749937649418198]

In [14]:
pd.read_csv('tutorial_output/models/KNN_REG/KNN_REG.cv.tsv', sep='\t')

Unnamed: 0,QSPRID,pchembl_value_Mean_Label,pchembl_value_Mean_Prediction,Fold
0,A2A_LIGANDS_221,7.20,7.774,0.0
1,A2A_LIGANDS_3260,6.02,5.858,0.0
2,A2A_LIGANDS_3675,8.32,7.764,0.0
3,A2A_LIGANDS_3484,6.77,7.824,0.0
4,A2A_LIGANDS_712,6.76,6.310,0.0
...,...,...,...,...
3260,A2A_LIGANDS_2041,6.00,6.240,9.0
3261,A2A_LIGANDS_2068,7.79,7.970,9.0
3262,A2A_LIGANDS_1806,8.03,6.662,9.0
3263,A2A_LIGANDS_474,8.70,8.228,9.0


Furthermore, we can specify the splitting strategy. 
By default, the dataset is split into 5 folds using a shuffle split.
You can find more information on how to split the data in the [data splitting tutorial](../data/data_splitting.ipynb).


In [13]:
from sklearn.model_selection import KFold

split = KFold(n_splits=10, shuffle=True, random_state=dataset.randomState)
CrossValAssessor(score_func, split=split)(model)

[0.6326776666707019,
 0.6679638227844381,
 0.6316441275385969,
 0.6644615008506205,
 0.6094387718918599,
 0.701911410383543,
 0.6275067773672989,
 0.6822540428310435,
 0.543230466648694,
 0.5948557924176254]