# Hyperparameter Optimization

In this tutorial we will give examples of how to use the hyperparameter optimization functionality of QSPRpred.

QSPRpred has a base class `HyperparameterOptimization` which defines the basic functionality of hyperparameter optimization. 
All subclasses of this class must implement the `optimize` method which takes a `QSPRModel` class as input and returns the best hyperparameters for the model. The optimization classes also require an `ModelAssessment` object as input which is used to assess the performance of the model, see the [model assessment tutorial](../../basics/modelling/model_assessment.ipynb) for more information.
As well as an score aggregation function to aggregate scores returned by the assessment classes into a single score (i.e. from different folds).

There are currently two hyperparameter optimization classes implemented in QSPRpred, `GridSearchOptimization` and `OptunaOptimization`.
In this tutorial we will show you how to use both of these classes to optimize the hyperparameters of a `KNeighborsRegressor` model, but you can find their documentation [here](https://cddleiden.github.io/QSPRpred/docs/api/qsprpred.models.html#module-qsprpred.models.hyperparam_optimization).

Let's first load the dataset and create the model to optimize.

In [1]:
import os

from IPython.display import display

from qsprpred.data import QSPRDataset, RandomSplit
from qsprpred.data.descriptors.fingerprints import MorganFP

os.makedirs("../../tutorial_output/data", exist_ok=True)

dataset = QSPRDataset.fromTableFile(
    filename='../../tutorial_data/A2A_LIGANDS.tsv',
    store_dir="../../tutorial_output/data",
    name="HyperparamOptTutorialTutorialDataset",
    target_props=[{"name": "pchembl_value_Mean", "task": "REGRESSION"}],
    random_state=42
)
dataset.sample(500)

display(dataset.getDF())

# calculate compound features and split dataset into train and test
dataset.prepareDataset(
    split=RandomSplit(test_fraction=0.2, dataset=dataset),
    feature_calculators=[MorganFP(radius=3, nBits=2048)],
    recalculate_features=True,
)

dataset.getDF().head()

from qsprpred.models import SklearnModel
from sklearn.neighbors import KNeighborsRegressor

os.makedirs("../../tutorial_output/models", exist_ok=True)

# This is an SKlearn model, so we will initialize it with the SklearnModel class
model = SklearnModel(
    base_dir='../../tutorial_output/models',
    alg=KNeighborsRegressor,
    name='HyperparamOptTutorialModel',
)

Failed to find the pandas get_adjustment() function to patch
Failed to patch pandas - PandasTools will have limited functionality


Unnamed: 0_level_0,SMILES,pchembl_value_Mean,Year,QSPRID,pchembl_value_Mean_original
QSPRID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
HyperparamOptTutorialTutorialDataset_0000,Cc1nn(-c2cc(NC(=O)CCN(C)C)nc(-c3ccc(C)o3)n2)c(...,8.68,2008.0,HyperparamOptTutorialTutorialDataset_0000,8.68
HyperparamOptTutorialTutorialDataset_0001,Nc1c(C(=O)Nc2ccc([N+](=O)[O-])cc2)sc2c1cc1CCCC...,4.82,2010.0,HyperparamOptTutorialTutorialDataset_0001,4.82
HyperparamOptTutorialTutorialDataset_0002,O=C(Nc1nc2ncccc2n2c(=O)n(-c3ccccc3)nc12)c1ccccc1,5.65,2009.0,HyperparamOptTutorialTutorialDataset_0002,5.65
HyperparamOptTutorialTutorialDataset_0003,CNC(=O)C12CC1C(n1cnc3c1nc(C#CCCCCC(=O)OC)nc3NC...,5.45,2009.0,HyperparamOptTutorialTutorialDataset_0003,5.45
HyperparamOptTutorialTutorialDataset_0004,CCCn1c(=O)c2c(nc3cc(OC)ccn32)n(CCCNC(=O)c2ccc(...,5.20,2019.0,HyperparamOptTutorialTutorialDataset_0004,5.20
...,...,...,...,...,...
HyperparamOptTutorialTutorialDataset_4077,CNc1ncc(C(=O)NCc2ccc(OC)cc2)c2nc(-c3ccco3)nn12,7.09,2018.0,HyperparamOptTutorialTutorialDataset_4077,7.09
HyperparamOptTutorialTutorialDataset_4078,Nc1nc(-c2ccco2)c2ncn(C(=O)NCCc3ccccc3)c2n1,8.22,2008.0,HyperparamOptTutorialTutorialDataset_4078,8.22
HyperparamOptTutorialTutorialDataset_4079,Nc1nc(Nc2ccc(F)cc2)nc(CSc2nnc(N)s2)n1,4.89,2010.0,HyperparamOptTutorialTutorialDataset_4079,4.89
HyperparamOptTutorialTutorialDataset_4080,CCCOc1ccc(C=Cc2cc3c(c(=O)n(C)c(=O)n3C)n2C)cc1,6.51,2013.0,HyperparamOptTutorialTutorialDataset_4080,6.51


This will initialize the model with default hyperparameters. In this case, the `parameters` property will be empty:

In [2]:
model.parameters

However, we can check the full set of estimator's hyperparameters as well (note that this might change with the underlying model implementation, this is only true for scikit-learn models):

In [3]:
model.estimator

In [4]:
model.estimator.get_params()

{'algorithm': 'auto',
 'leaf_size': 30,
 'metric': 'minkowski',
 'metric_params': None,
 'n_jobs': None,
 'n_neighbors': 5,
 'p': 2,
 'weights': 'uniform'}

You can make sure which parameters were set after saving the model as well by checking the respective json files:

In [5]:
import json


def load_json(file):
    with open(file, 'r') as f:
        return json.load(f)


meta_file, estimator_file = model.save(save_estimator=True)
meta_file, estimator_file

('/home/sichom/projects/QSPRpred/tutorials/tutorial_output/models/HyperparamOptTutorialModel/HyperparamOptTutorialModel_meta.json',
 '/home/sichom/projects/QSPRpred/tutorials/tutorial_output/models/HyperparamOptTutorialModel/HyperparamOptTutorialModel.json')

In [6]:
load_json(meta_file)["py/state"]["parameters"]

In [7]:
load_json(estimator_file)["params"]

{'algorithm': 'auto',
 'leaf_size': 30,
 'metric': 'minkowski',
 'metric_params': None,
 'n_jobs': None,
 'n_neighbors': 5,
 'p': 2,
 'weights': 'uniform'}

Next, we will take a look at the hyperparameter optimization classes and how to use them.

# Grid search
Grid search is a simple hyperparameter optimization method which simply tries all combinations of hyperparameters in a grid and returns the best combination. We will first specify the grid of hyperparameters to search over.

In [8]:
import numpy as np
from qsprpred.models import GridSearchOptimization, CrossValAssessor

# Define the search space
search_space = {"n_neighbors": [3, 5], "weights": ["uniform"]}

gridsearcher = GridSearchOptimization(
    param_grid=search_space,
    model_assessor=CrossValAssessor(scoring='r2'),
    score_aggregation=np.median
)
gridsearcher.optimize(model, dataset)



{'n_neighbors': 5, 'weights': 'uniform'}

The `optimize` method will return the best hyperparameters found for the model as determined by the used strategy. The best hyperparameters are set to the model and all subsequent fits will use these parameters unless specified otherwise. These parameters are also automatically saved to our model:

In [9]:
load_json(meta_file)["py/state"]["parameters"]

{'n_neighbors': 5, 'weights': 'uniform'}

In [10]:
load_json(estimator_file)["params"]  # only the optimized hyperparameters were changed

{'algorithm': 'auto',
 'leaf_size': 30,
 'metric': 'minkowski',
 'metric_params': None,
 'n_jobs': None,
 'n_neighbors': 5,
 'p': 2,
 'weights': 'uniform'}

Hyperparmater optimizatin also resets the model, so we need to fit the model again before making predictions:

In [11]:
from sklearn.exceptions import NotFittedError

try:
    # unable since the model is reset with the new hyperparameter setting
    model.predict(dataset)
except NotFittedError as e:
    print(e)

This KNeighborsRegressor instance is not fitted yet. Call 'fit' with appropriate arguments before using this estimator.


In [12]:
# can make predictions after fitting
model.fitDataset(dataset)
model.predict(dataset)

array([[7.154     ],
       [8.356     ],
       [4.712     ],
       ...,
       [5.87      ],
       [4.834     ],
       [6.92066667]])

However, you can skip this step by setting the `refit_optimal` parameter to `True` in the `optimize` method. This will automatically fit the model with the best hyperparameters found:

In [13]:
search_space = {"n_neighbors": [1, 2], "weights": ["uniform"]}

gridsearcher = GridSearchOptimization(
    param_grid=search_space,
    model_assessor=CrossValAssessor(scoring='r2'),
    score_aggregation=np.median
)
gridsearcher.optimize(model, dataset, refit_optimal=True)

{'n_neighbors': 2, 'weights': 'uniform'}

In [14]:
model.predict(dataset)

array([[7.57 ],
       [8.335],
       [4.51 ],
       ...,
       [5.87 ],
       [4.98 ],
       [6.185]])

Notice the change in output values with the new hyperparameters (the grid was changed on purpose to show the difference).

# Optuna
In addition to the grid search, it is possible to use [Optuna](https://optuna.org/) to optimize the hyperparameters of a model. Optuna is a hyperparameter optimization framework which uses Bayesian optimization to find the best hyperparameters.

Mainly, setting up the optimization is the same as for grid search, however, when specifiying the 
search space we need to also specify the type of each hyperparameter. This is because Optuna needs to know how to sample the hyperparameters.
Aside from the search space, we also need to specify the number of trials to run, which is the number of different hyperparameter combinations to try. In this case, we will also use the `TestSetAssessor` to assess the performance of the model on the test set, to show the different ways of assessing the performance of the model in the optimization. 

We will now use Optuna to make a more thorough exploration of hyperparameters for our model:


In [15]:
from qsprpred.models import OptunaOptimization, TestSetAssessor

# Note the specification of the hyperparameter types as first item in the list
search_space = {"n_neighbors": ["int", 1, 10],
                "weights": ["categorical", ["uniform", "distance"]]}

# Optuna gridsearcher with the TestSetAssessor
gridsearcher = OptunaOptimization(
    n_trials=10,
    param_grid=search_space,
    model_assessor=TestSetAssessor(scoring='r2'),
)
gridsearcher.optimize(model, dataset, refit_optimal=True)

[I 2024-03-25 11:04:28,398] A new study created in memory with name: no-name-9759c32c-a7b9-4ac5-b024-4b1101e9941a
[I 2024-03-25 11:04:28,512] Trial 0 finished with value: 0.6314076647146836 and parameters: {'n_neighbors': 4, 'weights': 'uniform'}. Best is trial 0 with value: 0.6314076647146836.
[I 2024-03-25 11:04:28,629] Trial 1 finished with value: 0.6244271327283943 and parameters: {'n_neighbors': 6, 'weights': 'uniform'}. Best is trial 0 with value: 0.6314076647146836.
[I 2024-03-25 11:04:28,734] Trial 2 finished with value: 0.486307378093249 and parameters: {'n_neighbors': 1, 'weights': 'uniform'}. Best is trial 0 with value: 0.6314076647146836.
[I 2024-03-25 11:04:28,851] Trial 3 finished with value: 0.6297875073289022 and parameters: {'n_neighbors': 8, 'weights': 'distance'}. Best is trial 0 with value: 0.6314076647146836.
[I 2024-03-25 11:04:28,973] Trial 4 finished with value: 0.6158780548550493 and parameters: {'n_neighbors': 9, 'weights': 'uniform'}. Best is trial 0 with val

{'n_neighbors': 5, 'weights': 'distance'}

In [16]:
model.predict(dataset)

array([[7.2       ],
       [8.07      ],
       [4.8       ],
       ...,
       [5.99714729],
       [4.95978074],
       [6.70326748]])