# Model assessment

To assess the performance of the QSPR models, there are two methods available in `QSPRpred`
from the [`assessment.methods`](https://cddleiden.github.io/QSPRpred/docs/api/qsprpred.models.html#module-qsprpred.models.assessment.methods) module:
* CrossValAssessor: Performs cross validation on a dataset
* TestSetAssessor: Performs predictions on a test set

In this notebook, we will demonstrate how to use these methods.
Let's start by loading the data and creating the model we want to assess.

In [1]:
import os

from IPython.display import display

from qsprpred.data import QSPRDataset, RandomSplit
from qsprpred.data.descriptors.fingerprints import MorganFP

os.makedirs("../../tutorial_output/data", exist_ok=True)

dataset = QSPRDataset.fromTableFile(
    filename="../../tutorial_data/A2A_LIGANDS.tsv",
    store_dir="../../tutorial_output/data",
    name="AssessmentTutorialDataset",
    target_props=[{"name": "pchembl_value_Mean", "task": "REGRESSION"}],
    random_state=42
)

display(dataset.getDF())

# calculate compound features and split dataset into train and test
dataset.prepareDataset(
    split=RandomSplit(test_fraction=0.2, dataset=dataset),
    feature_calculators=[MorganFP(radius=3, nBits=2048)],
    recalculate_features=True,
)

dataset.getDF().head()

from qsprpred.models import SklearnModel
from sklearn.neighbors import KNeighborsRegressor

os.makedirs("../../tutorial_output/models", exist_ok=True)

# This is an SKlearn model, so we will initialize it with the SklearnModel class
model = SklearnModel(
    base_dir="../../tutorial_output/models",
    alg=KNeighborsRegressor,
    name="AssessmentTutorialModel",
)

Unnamed: 0_level_0,SMILES,pchembl_value_Mean,Year,QSPRID,pchembl_value_Mean_original
QSPRID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
AssessmentTutorialDataset_0000,Cc1nn(-c2cc(NC(=O)CCN(C)C)nc(-c3ccc(C)o3)n2)c(...,8.68,2008.0,AssessmentTutorialDataset_0000,8.68
AssessmentTutorialDataset_0001,Nc1c(C(=O)Nc2ccc([N+](=O)[O-])cc2)sc2c1cc1CCCC...,4.82,2010.0,AssessmentTutorialDataset_0001,4.82
AssessmentTutorialDataset_0002,O=C(Nc1nc2ncccc2n2c(=O)n(-c3ccccc3)nc12)c1ccccc1,5.65,2009.0,AssessmentTutorialDataset_0002,5.65
AssessmentTutorialDataset_0003,CNC(=O)C12CC1C(n1cnc3c1nc(C#CCCCCC(=O)OC)nc3NC...,5.45,2009.0,AssessmentTutorialDataset_0003,5.45
AssessmentTutorialDataset_0004,CCCn1c(=O)c2c(nc3cc(OC)ccn32)n(CCCNC(=O)c2ccc(...,5.20,2019.0,AssessmentTutorialDataset_0004,5.20
...,...,...,...,...,...
AssessmentTutorialDataset_4077,CNc1ncc(C(=O)NCc2ccc(OC)cc2)c2nc(-c3ccco3)nn12,7.09,2018.0,AssessmentTutorialDataset_4077,7.09
AssessmentTutorialDataset_4078,Nc1nc(-c2ccco2)c2ncn(C(=O)NCCc3ccccc3)c2n1,8.22,2008.0,AssessmentTutorialDataset_4078,8.22
AssessmentTutorialDataset_4079,Nc1nc(Nc2ccc(F)cc2)nc(CSc2nnc(N)s2)n1,4.89,2010.0,AssessmentTutorialDataset_4079,4.89
AssessmentTutorialDataset_4080,CCCOc1ccc(C=Cc2cc3c(c(=O)n(C)c(=O)n3C)n2C)cc1,6.51,2013.0,AssessmentTutorialDataset_4080,6.51


## Cross-validation
Now we will assess our model using cross-validation. We will use the `CrossValAssessor` class for this.
The `CrossValAssessor` class takes a model as input and uses the attached dataset. The training dataset is split into a number of folds, and the model is trained on each fold. The performance of the model is then assessed on the fold that was not used for training,
and the results are stored in a file `{model_name}_cv.tsv`.

Note. if the dataset was not split into a train and a test set, the whole dataset is used for cross-validation.

To score the performance of the model, we need to provide a scoring function to the `CrossValAssessor`. 
In most examples throughout these tutorials we will pass a string to use one of the predefined [sckikit learn scoring functions](https://scikit-learn.org/stable/modules/model_evaluation.html).



In [2]:
from qsprpred.models import CrossValAssessor

# Create a CrossValAssessor object
CrossValAssessor("r2")(model, dataset)



array([0.63378212, 0.64599353, 0.67213109, 0.65261028, 0.57733319])

Internally, the ModelAssessor wraps the scikit learn scoring functions in the `SklearnMetrics` class, which converts QSPRpred model predictions to the format expected by scikit learn scoring functions.
You can also explicitly pass a `SklearnMetrics` object to the `CrossValAssessor` and initialize with a string, as shown below:

In [3]:
from qsprpred.models import CrossValAssessor
from qsprpred.models import SklearnMetrics

score_func = SklearnMetrics("r2")

# Create a CrossValAssessor object
CrossValAssessor(score_func)(model, dataset)

array([0.63378212, 0.64599353, 0.67213109, 0.65261028, 0.57733319])

Or intialize the `SklearnMetrics` object with a scikit learn scoring function, as shown below:

In [4]:
from sklearn.metrics import make_scorer, r2_score

score_func = SklearnMetrics(make_scorer(r2_score))

CrossValAssessor(score_func)(model, dataset)

array([0.63378212, 0.64599353, 0.67213109, 0.65261028, 0.57733319])

Furthermore, we can specify the splitting strategy. 
By default, the dataset is split into 5 folds using a shuffle split.
You can find more information on how to split the data in the [data splitting tutorial](../data/data_splitting.ipynb).


In [5]:
from sklearn.model_selection import KFold

split = KFold(n_splits=10, shuffle=True, random_state=dataset.randomState)
CrossValAssessor("r2", split=split)(model, dataset)

array([0.63570264, 0.66704785, 0.63166829, 0.66808678, 0.61509333,
       0.70523338, 0.62394155, 0.68179023, 0.54670711, 0.59884993])

## Test set validation
To assess the performance of the model on the specified test set, we can use the `TestSetAssessor` class.
It works in a similar way as the `CrossValAssessor`, but instead of splitting the dataset into folds, it uses the complete training set to train the model, and then uses the associated test set to assess the performance of the model.
It therefore does not require a splitting strategy, but it does require a test set to have been defined in the dataset by performing an initial split during dataset preparation.
The results are stored in a file `{model_name}.ind.tsv` ("ind" from independent test set).

In [6]:
from qsprpred.models import TestSetAssessor

TestSetAssessor("r2")(model, dataset)

array([0.6306765])

## More information
These two classes are subclasses of the `ModelAssessor` class, which contains the shared functionality.
To customize the behavior of the assessors, you can create a subclass the `ModelAssessor` class.
Furthermore, the `ModelAssessor` objects can be used to assess the model performance
in hyperparameter optimization, as is shown in the [hyperparameter optimization tutorial](../../advanced/modelling/hyperparameter_optimization.ipynb).