# Hyperparameter Optimization

In this tutorial we will give examples of how to use the hyperparameter optimization functionality of QSPRpred.

QSPRpred has a base class `HyperparameterOptimization` which defines the basic functionality of hyperparameter optimization. 
All subclasses of this class must implement the `optimize` method which takes a `QSPRModel` class as input and returns the best hyperparameters for the model. The optimization classes also require an `ModelAssessment` object as input which is used to assess the performance of the model, see the [model assessment tutorial](../../basics/modelling/model_assessment.ipynb) for more information.
As well as an score aggregation function to aggregate scores returned by the assessment classes into a single score (i.e. from different folds).

There are currently two hyperparameter optimization classes implemented in QSPRpred, `GridSearchOptimization` and `OptunaOptimization`.
In this tutorial we will show you how to use both of these classes to optimize the hyperparameters of a `KNeighborsRegressor` model, but you can find there documentation [here](https://cddleiden.github.io/QSPRpred/docs/api/qsprpred.models.html#module-qsprpred.models.hyperparam_optimization).

Let's first load the dataset and create the model to optimize.

In [1]:
import os
from IPython.display import display
from qsprpred.data.data import QSPRDataset
from qsprpred.data.utils.descriptorsets import FingerprintSet
from qsprpred.data.utils.descriptorcalculator import MoleculeDescriptorsCalculator
from qsprpred.data.utils.datasplitters import RandomSplit

os.makedirs("../../tutorial_output/data", exist_ok=True)

dataset = QSPRDataset.fromTableFile(
  	filename='../../tutorial_data/A2A_LIGANDS.tsv', 
  	store_dir="../../tutorial_output/data",
  	name="HyperparamOptTutorialTutorialDataset",
  	target_props=[{"name": "pchembl_value_Mean", "task": "REGRESSION"}],
  	random_state=42
)

display(dataset.getDF())

# Calculate MorganFP features
feature_calculator = MoleculeDescriptorsCalculator(desc_sets = [FingerprintSet(fingerprint_type="MorganFP", radius=3, nBits=2048)])

# calculate compound features and split dataset into train and test
dataset.prepareDataset(
    split=RandomSplit(test_fraction=0.2, dataset=dataset),
    feature_calculators=[feature_calculator],
    recalculate_features=True,
)

dataset.getDF().head()

from qsprpred.models.sklearn import SklearnModel
from sklearn.neighbors import KNeighborsRegressor

os.makedirs("../../tutorial_output/models", exist_ok=True)

# This is an SKlearn model, so we will initialize it with the SklearnModel class
model = SklearnModel(
    base_dir = '../../tutorial_output/models',
    data = dataset,
    alg = KNeighborsRegressor,
    name = 'HyperparamOptTutorialModel',
)

Unnamed: 0_level_0,SMILES,pchembl_value_Mean,Year,QSPRID
QSPRID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
HyperparamOptTutorialTutorialDataset_0,Cc1nn(-c2cc(NC(=O)CCN(C)C)nc(-c3ccc(C)o3)n2)c(...,8.68,2008.0,HyperparamOptTutorialTutorialDataset_0
HyperparamOptTutorialTutorialDataset_1,Nc1c(C(=O)Nc2ccc([N+](=O)[O-])cc2)sc2c1cc1CCCC...,4.82,2010.0,HyperparamOptTutorialTutorialDataset_1
HyperparamOptTutorialTutorialDataset_2,O=C(Nc1nc2ncccc2n2c(=O)n(-c3ccccc3)nc12)c1ccccc1,5.65,2009.0,HyperparamOptTutorialTutorialDataset_2
HyperparamOptTutorialTutorialDataset_3,CNC(=O)C12CC1C(n1cnc3c1nc(C#CCCCCC(=O)OC)nc3NC...,5.45,2009.0,HyperparamOptTutorialTutorialDataset_3
HyperparamOptTutorialTutorialDataset_4,CCCn1c(=O)c2c(nc3cc(OC)ccn32)n(CCCNC(=O)c2ccc(...,5.20,2019.0,HyperparamOptTutorialTutorialDataset_4
...,...,...,...,...
HyperparamOptTutorialTutorialDataset_4077,CNc1ncc(C(=O)NCc2ccc(OC)cc2)c2nc(-c3ccco3)nn12,7.09,2018.0,HyperparamOptTutorialTutorialDataset_4077
HyperparamOptTutorialTutorialDataset_4078,Nc1nc(-c2ccco2)c2ncn(C(=O)NCCc3ccccc3)c2n1,8.22,2008.0,HyperparamOptTutorialTutorialDataset_4078
HyperparamOptTutorialTutorialDataset_4079,Nc1nc(Nc2ccc(F)cc2)nc(CSc2nnc(N)s2)n1,4.89,2010.0,HyperparamOptTutorialTutorialDataset_4079
HyperparamOptTutorialTutorialDataset_4080,CCCOc1ccc(C=Cc2cc3c(c(=O)n(C)c(=O)n3C)n2C)cc1,6.51,2013.0,HyperparamOptTutorialTutorialDataset_4080


# Grid search
Grid search is a simple hyperparameter optimization method which simply tries all combinations of hyperparameters in a grid and returns the best combination. We will first specify the grid of hyperparameters to search over.

In [8]:
import numpy as np
from qsprpred.models.hyperparam_optimization import GridSearchOptimization
from qsprpred.models.assessment_methods import CrossValAssessor
from qsprpred.models.metrics import SklearnMetric

# Define the search space
search_space = {"n_neighbors": [3, 5], "weights": ["uniform", "distance"]}

gridsearcher = GridSearchOptimization(
    param_grid=search_space,
    model_assessor=CrossValAssessor(scoring=SklearnMetric.getDefaultMetric(model.task)),
    score_aggregation=np.median
)
gridsearcher.optimize(model)

{'n_neighbors': 5, 'weights': 'distance'}

# Optuna
In addition to the grid search, it is possible to use [Optuna](https://optuna.org/) to optimize the hyperparameters of a model. Optuna is a hyperparameter optimization framework which uses Bayesian optimization to find the best hyperparameters.

Mainly, setting up the optimization is the same as for grid search, however, when specifiying the 
search space we need to also specify the type of each hyperparameter. This is because Optuna needs to know how to sample the hyperparameters.
Aside from the search space, we also need to specify the number of trials to run, which is the number of different hyperparameter combinations to try. In this case, we will also use the `TestSetAssessor` to assess the performance of the model on the test set, to show the different ways of assessing the performance of the model in the optimization.


In [7]:
import numpy as np
from qsprpred.models.hyperparam_optimization import OptunaOptimization
from qsprpred.models.assessment_methods import TestSetAssessor
from qsprpred.models.metrics import SklearnMetric

# Note the specification of the hyperparameter types as first item in the list
search_space = {"n_neighbors": ["int", 1, 10], "weights": ["categorical", ["uniform", "distance"]]}

# Optuna gridsearcher with the TestSetAssessor
gridsearcher = OptunaOptimization(
    n_trials=10,
    param_grid=search_space,
    model_assessor=TestSetAssessor(scoring=SklearnMetric.getDefaultMetric(model.task)),
)
gridsearcher.optimize(model)

[I 2023-11-10 16:39:43,797] A new study created in memory with name: no-name-f353bb31-0f2e-4383-9d6f-373686825ab2
[I 2023-11-10 16:39:43,983] Trial 0 finished with value: 0.6355860742620096 and parameters: {'n_neighbors': 4, 'weights': 'uniform'}. Best is trial 0 with value: 0.6355860742620096.
[I 2023-11-10 16:39:44,174] Trial 1 finished with value: 0.6291165241252481 and parameters: {'n_neighbors': 6, 'weights': 'uniform'}. Best is trial 0 with value: 0.6355860742620096.
[I 2023-11-10 16:39:44,362] Trial 2 finished with value: 0.491664653614183 and parameters: {'n_neighbors': 1, 'weights': 'uniform'}. Best is trial 0 with value: 0.6355860742620096.
[I 2023-11-10 16:39:44,613] Trial 3 finished with value: 0.6374320800581863 and parameters: {'n_neighbors': 8, 'weights': 'distance'}. Best is trial 3 with value: 0.6374320800581863.
[I 2023-11-10 16:39:44,809] Trial 4 finished with value: 0.6237251938045083 and parameters: {'n_neighbors': 9, 'weights': 'uniform'}. Best is trial 3 with val

{'n_neighbors': 5, 'weights': 'distance'}