# Hyperparameter Optimization
In this tutorial we demonstrate the following:
- Build a parameter dictionary to perform a `hyperparameter optimization` for a random forest using `Bayesian optimization`.
- Perform the optimization process.
- Review the results

We will use these **[AMPL](https://github.com/ATOMScience-org/AMPL)** functions here:
- [parse_params](https://ampl.readthedocs.io/en/latest/utils.html#utils.hyperparam_search_wrapper.parse_params)
- [build_search](https://ampl.readthedocs.io/en/latest/utils.html#utils.hyperparam_search_wrapper.build_search)
- [run_search](https://ampl.readthedocs.io/en/latest/utils.html#utils.hyperparam_search_wrapper.HyperOptSearch.run_search)
- [get_filesystem_perf_results](https://ampl.readthedocs.io/en/latest/pipeline.html#pipeline.compare_models.get_filesystem_perf_results)

`Hyperparameters` dictate the parameters of the training process and the architecture of the model itself. For example, the 
number of random trees is a hyperparameter for a random forest. In contrast, a learned parameter for a random forest is the set of features that is contained in a single node (in a single tree) and the cutoff values for each of those features that determines how the data is split at that node. A full discussion of hyperparameter optimization can be found on **[wikipedia](https://en.wikipedia.org/wiki/Hyperparameter_optimization)**.

The choice for hyperparameters strongly influence model performance,
so it is important to be able to optimize them as well. **[AMPL](https://github.com/ATOMScience-org/AMPL)**
offers a variety of hyperparameter optimization methods including
random sampling, grid search, and Bayesian optimization. Further information for **[AMPL](https://github.com/ATOMScience-org/AMPL)**'s `Bayesian optimization` can be found **[here](https://github.com/ATOMScience-org/AMPL#hyperparameter-optimization)**.

## Setup directories
Describe important features like descriptor type and output directories. Make sure the directories are created before training the models.

In [1]:
import warnings
warnings.filterwarnings('ignore', category=FutureWarning)
warnings.filterwarnings('ignore', category=RuntimeWarning)

import os

dataset_key='dataset/SLC6A3_Ki_curated.csv'
descriptor_type = 'rdkit_raw'
model_dir = 'dataset/SLC6A3_models'
best_model_dir = 'dataset/SLC6A3_models/best_models'
split_uuid = "c35aeaab-910c-4dcf-8f9f-04b55179aa1a"


if not os.path.exists(f'./{best_model_dir}'):
    os.mkdir(f'./{best_model_dir}')
    
if not os.path.exists(f'./{model_dir}'):
    os.mkdir(f'./{model_dir}')

## Parameter dictionary settings.
- `'hyperparam':True` This setting indicates that we are performing
a hyperparameter search instead of just training one model.
- `'previously_featurized':'True'` This tells AMPL to search for
previously generated features in `../dataset/scaled_descriptors` instead
of regenerating them on the fly.
- `'search_type':'hyperopt'` This specifies the hyperparameter
search method. Other options include grid, random, and geometric.
Specifications for each hyperparameter search method is different,
please refer to the full documentation. Here we are using the
`Bayesian optimization` method.
- `'model_type':'RF|10'` This means **[AMPL](https://github.com/ATOMScience-org/AMPL)** will try 10 times to 
find the best set of hyperparameters using random forests. In 
production this parameter could be set to 100 or more.
- `'rfe':'uniformint|8,512'` The `Bayesian optimizer` will uniformly
search between 8 and 512 for the best number of random forest estimators.
Similarly `rfd` stands for random forest depth and `rff` stands for
random forest features.
- `result_dir` Now expects two parameters. The first directory
will contain the best trained models while the second directory will
contain all models trained in the search.

Regression models are optimized using root mean squared loss and
classification models are optimized using area under the 
receiver operating characteristic curve.
A full list of parameters can be found on our github 
**[here](https://github.com/ATOMScience-org/AMPL/blob/master/atomsci/ddm/docs/PARAMETERS.md)**.

In [2]:
params = {
    "hyperparam": "True",
    "prediction_type": "regression",

    "dataset_key": dataset_key,
    "id_col": "compound_id",
    "smiles_col": "base_rdkit_smiles",
    "response_cols": "avg_pKi",

    "splitter":"scaffold",
    "split_uuid": split_uuid,
    "previously_split": "True",

    "featurizer": "computed_descriptors",
    "descriptor_type" : descriptor_type,
    "previously_featurized": "True",
    "transformers": "True",

    "search_type": "hyperopt",
    "model_type": "RF|10",
    "rfe": "uniformint|8,512",
    "rfd": "uniformint|6,32",
    "rff": "uniformint|8,200",

    "result_dir": f"./{best_model_dir},./{model_dir}"
}

In **tutorial 4** we directly imported the `parameter_parser` and `model_pipeline` objects to parse the config dict and train a single model. Here, we use `hyperparameter_search_wrapper` to handle many models for us. First we build the search by creating a list of parameters to use, and then we run the search.

In [4]:
import atomsci.ddm.utils.hyperparam_search_wrapper as hsw
import importlib
importlib.reload(hsw)
ampl_param = hsw.parse_params(params)
hs = hsw.build_search(ampl_param)
hs.run_search()

model_performance|train_r2|train_rms|valid_r2|valid_rms|test_r2|test_rms|model_params|model

rf_estimators: 306, rf_max_depth: 8, rf_max_feature: 185
RF model with computed_descriptors and rdkit_raw      
  0%|          | 0/10 [00:00<?, ?trial/s, best loss=?]

[11:25:06] UFFTYPER: Unrecognized charge state for atom: 10

[11:25:06] UFFTYPER: Unrecognized charge state for atom: 11

[11:29:57] UFFTYPER: Unrecognized charge state for atom: 6

[11:30:20] UFFTYPER: Unrecognized charge state for atom: 4

[11:30:20] UFFTYPER: Unrecognized charge state for atom: 4

[11:30:20] UFFTYPER: Unrecognized charge state for atom: 4

[11:30:36] UFFTYPER: Unrecognized charge state for atom: 1

2024-04-15 11:32:15,069 Featurized file already exists. Continuing:
2024-04-15 11:32:15,088 Previous dataset split restored


model_performance|0.869|0.450|0.453|0.894|0.427|0.922|306_8_185|./dataset/SLC6A3_models/SLC6A3_Ki_curated_model_449b4803-9aeb-4be0-8704-941022a95671.tar.gz

rf_estimators: 372, rf_max_depth: 24, rf_max_feature: 123                          
RF model with computed_descriptors and rdkit_raw                                   
 10%|█         | 1/10 [08:02<1:12:24, 482.74s/trial, best loss: 0.5472864835369445]

[11:33:06] UFFTYPER: Unrecognized charge state for atom: 10

[11:33:06] UFFTYPER: Unrecognized charge state for atom: 11

[11:38:03] UFFTYPER: Unrecognized charge state for atom: 6

[11:38:24] UFFTYPER: Unrecognized charge state for atom: 4

[11:38:24] UFFTYPER: Unrecognized charge state for atom: 4

[11:38:24] UFFTYPER: Unrecognized charge state for atom: 4

[11:38:38] UFFTYPER: Unrecognized charge state for atom: 1

2024-04-15 11:39:41,837 Featurized file already exists. Continuing:
2024-04-15 11:39:41,867 Previous dataset split restored


model_performance|0.951|0.276|0.486|0.867|0.437|0.914|372_24_123|./dataset/SLC6A3_models/SLC6A3_Ki_curated_model_6fb44e08-ffd6-476b-8d42-274818154474.tar.gz

rf_estimators: 22, rf_max_depth: 8, rf_max_feature: 172                            
RF model with computed_descriptors and rdkit_raw                                   
 20%|██        | 2/10 [15:35<1:01:58, 464.85s/trial, best loss: 0.5144174634350562]

[11:40:43] UFFTYPER: Unrecognized charge state for atom: 10

[11:40:43] UFFTYPER: Unrecognized charge state for atom: 11



model_performance|0.000|100.000|0.000|100.000|0.000|100.000|22_8_172|./dataset/SLC6A3_models/SLC6A3_Ki_curated_model_4db9a2a1-c04e-445a-bc63-3fae88c5cd24.tar.gz

rf_estimators: 397, rf_max_depth: 20, rf_max_feature: 171                          
RF model with computed_descriptors and rdkit_raw                                 
 30%|███       | 3/10 [18:52<39:59, 342.82s/trial, best loss: 0.5144174634350562]

The top scoring model will be saved in `dataset/SLC6A3_models/best_models` along with a csv file
containing regression performance for all trained models.

All of the models are saved in `dataset/SLC6A3_models`. These models can be
explored using `get_filesystem_perf_results`. A full analysis of the hyperparameter performance is explored in **tutorial 7**.

In [8]:
import atomsci.ddm.pipeline.compare_models as cm

result_df = cm.get_filesystem_perf_results(
    result_dir=model_dir,
    pred_type='regression'
)

# sort by validation r2 score to see top performing models
result_df = result_df.sort_values(by='best_valid_r2_score', ascending=False)
result_df[['model_uuid','model_parameters_dict','best_valid_r2_score','best_test_r2_score']].head()

Found data for 12 models under dataset/SLC6A3_models


Unnamed: 0,model_uuid,model_parameters_dict,best_valid_r2_score,best_test_r2_score
2,520295f8-2ac5-45d8-9433-d3a0e96ccf0c,"{""rf_estimators"": 170, ""rf_max_depth"": 15, ""rf...",0.497099,0.43862
5,37a85d2a-1c0e-48aa-bcb9-f1b0a6107f29,"{""rf_estimators"": 421, ""rf_max_depth"": 16, ""rf...",0.496579,0.42124
11,d3388e81-d151-420d-b55c-b10627d1c71e,"{""rf_estimators"": 500, ""rf_max_depth"": 14, ""rf...",0.489879,0.437214
0,8afb64d6-993e-4d8b-9072-60dcb40d2c83,"{""rf_estimators"": 500, ""rf_max_depth"": null, ""...",0.489673,0.416391
6,3eb607ee-cf68-4eac-bc0c-92689443a278,"{""rf_estimators"": 382, ""rf_max_depth"": 19, ""rf...",0.484862,0.433234


### Examples for other parameters
Below are some parameters that can be used for neural networks, 
**[XGBoost](https://en.wikipedia.org/wiki/XGBoost)** models, 
fingerprint splits and **[ECFP](https://pubs.acs.org/doi/10.1021/ci100050t)** features.
Each set of parameters can be used to replace the parameters above. 
Trying them out is left as an exercise for the reader.

#### Neural Network Hyperopt Search
- `lr` This controls the learning rate. `loguniform|-13.8,-3` means the logarithm of 
the learning rate is uniformly distributed between `-13.8` and `-3`.
- `ls` This controls layer sizes. `3|8,512` means 3 layers with sizes ranging
between 8 and 512 neurons. A good strategy is to start with a fewer layers 
and slowly increase the number until performance plateaus. 
- `dp` This controls dropout. `3|0,0.4` means 3 dropout layers with
probability of zeroing a weight between 0 and 40%. This needs to match the 
number of layers specified with `ls` and should range between 0% and 50%. 
- `max_epochs` This controls how long to train each model. Training for more
epochs increases runtime, but allows models more time to optimize. 

```
params = {
    "hyperparam": "True",
    "prediction_type": "regression",

    "dataset_key": dataset_key,
    "id_col": "compound_id",
    "smiles_col": "base_rdkit_smiles",
    "response_cols": "avg_pKi",

    "splitter":"scaffold",
    "split_uuid": split_uuid,
    "previously_split": "True",

    "featurizer": "computed_descriptors",
    "descriptor_type" : descriptor_type,
    "transformers": "True",

    ### Use a NN model
    "search_type": "hyperopt",
    "model_type": "NN|10",
    "lr": "loguniform|-13.8,-3",
    "ls": "uniformint|3|8,512",
    "dp": "uniform|3|0,0.4",
    "max_epochs":100
    ###

    "result_dir": f"./{best_model_dir},./{model_dir}"
}
```

#### XGBoost
- `xgbg` Stands for xgb_gamma and controls the minimum loss 
reduction required to make a further partition on a leaf node of the tree.
- `xgbl` Stands for xgb_learning_rate and controls the boosting 
learning rate searching domain of XGBoost models.

```
params = {
    "hyperparam": "True",
    "prediction_type": "regression",

    "dataset_key": dataset_key,
    "id_col": "compound_id",
    "smiles_col": "base_rdkit_smiles",
    "response_cols": "avg_pKi",

    "splitter":"scaffold",
    "split_uuid": split_uuid,
    "previously_split": "True",

    "featurizer": "computed_descriptors",
    "descriptor_type" : descriptor_type,
    "transformers": "True",

    ### Use an XGBoost model
    "search_type": "hyperopt",
    "model_type": "xgboost|10",
    "xgbg": "uniform|0,0.2",
    "xgbl": "loguniform|-2,2",
    ###

    "result_dir": f"./{best_model_dir},./{model_dir}"
}
```

#### Fingerprint Split
This trains an XGBoost model using a 
fingerprint split created in **tutorial 3**.

```
fp_split_uuid="be60c264-6ac0-4841-a6b6-41bf846e4ae4"

params = {
    "hyperparam": "True",
    "prediction_type": "regression",

    "dataset_key": dataset_key,
    "id_col": "compound_id",
    "smiles_col": "base_rdkit_smiles",
    "response_cols": "avg_pKi",

    ### Use a fingerprint split
    "splitter":"fingerprint",
    "split_uuid": fp_split_uuid,
    "previously_split": "True",
    ###

    "featurizer": "computed_descriptors",
    "descriptor_type" : descriptor_type,
    "transformers": "True",

    "search_type": "hyperopt",
    "model_type": "xgboost|10",
    "xgbg": "uniform|0,0.2",
    "xgbl": "loguniform|-2,2",

    "result_dir": f"./{best_model_dir},./{model_dir}"
}
```

#### ECFP Features
This uses an XGBoost model with ECFP features and a scaffold split.

```
fp_split_uuid="be60c264-6ac0-4841-a6b6-41bf846e4ae4"

params = {
    "hyperparam": "True",
    "prediction_type": "regression",

    "dataset_key": dataset_key,
    "id_col": "compound_id",
    "smiles_col": "base_rdkit_smiles",
    "response_cols": "avg_pKi",

    "splitter":"scaffold",
    "split_uuid": split_uuid,
    "previously_split": "True",

    ### Use ECFP Features
    "featurizer": "ecfp",
    "ecfp_radius" : 2,
    "ecfp_size" : 1024,
    "transformers": "True",
    ###

    "search_type": "hyperopt",
    "model_type": "xgboost|10",
    "xgbg": "uniform|0,0.2",
    "xgbl": "loguniform|-2,2",

    "result_dir": f"./{best_model_dir},./{model_dir}"
}
```

In **tutorial 7**, we analyze the performance of these large sets of models to select the best `hyperparameters` for `production models`.