# Hyperparameter Optimization
In this tutorial we demonstrate the following:
- Build a parameter dictionary to perform a hyperparameter optimization for a random forest using Bayesian optimization.
- Perform the optimization process.
- Review the results

Hyperparameters dictate the parameters of the training process and the architecture of the model itself. For example, the 
number of random trees is a hyperparameter for a random forest. In contrast, a learned parameter for a random forest is the set of features that is contained in a single node (in a single tree) and the cutoff values for each of those features that determines how the data is split at that node. A full discussion of hyperparameter optimization can be found on **[wikipedia](https://en.wikipedia.org/wiki/Hyperparameter_optimization)**.

The choice for hyperparameters strongly influence model performance,
so it is important to be able to optimize them as well. **[AMPL](https://github.com/ATOMScience-org/AMPL)**
offers a variety of hyperparameter optimization methods including
random sampling, grid search, and Bayesian optimization. Further information for AMPL's Bayesian optimization can be found **[here](https://github.com/ATOMScience-org/AMPL#hyperparameter-optimization)**.

## Setup directories
Describe important features like descriptor type and output directories. Make sure the directories are created before training the models.

In [4]:
import warnings
warnings.filterwarnings('ignore', category=FutureWarning)

import os

dataset_key='dataset/SLC6A3_Ki_curated.csv'
descriptor_type = 'rdkit_raw'
model_dir = 'dataset/SLC6A3_models'
best_model_dir = 'dataset/SLC6A3_models/best_models'
split_uuid = "c35aeaab-910c-4dcf-8f9f-04b55179aa1a"


if not os.path.exists(f'./{best_model_dir}'):
    os.mkdir(f'./{best_model_dir}')
    
if not os.path.exists(f'./{model_dir}'):
    os.mkdir(f'./{model_dir}')

## Parameter dictionary settings.
- `'hyperparam':True` This setting indicates that we are performing
a hyperparameter search instead of just training one model.
- `'search_type':'hyperopt'` This specifies the hyperparameter
search method. Other options include grid, random, and geometric.
Specifications for each hyperparameter search method is different,
please refer to the full documentation. Here we are using the
Bayesian optimization method.
- `'model_type':'RF|10'` This means **[AMPL](https://github.com/ATOMScience-org/AMPL)** will try 10 times to 
find the best set of hyperparameters using random forests. In 
production this parameter could be set to 100 or more.
- `'rfe':'uniformint|8,512'` The Bayesian optimizer will uniformly
search between 8 and 512 for the best number of random forest estimators.
Similarly `rfd` stands for random forest depth and `rff` stands for
random forest features.
- `result_dir` Now expects two parameters. The first directory
will contain the best trained models while the second directory will
contain all models trained in the search.

A full list of parameters can be found on our github **[here](https://github.com/ATOMScience-org/AMPL/blob/master/atomsci/ddm/docs/PARAMETERS.md)**.

In [5]:
params = {
    "hyperparam": "True",
    "prediction_type": "regression",

    "dataset_key": dataset_key,
    "id_col": "compound_id",
    "smiles_col": "base_rdkit_smiles",
    "response_cols": "avg_pKi",

    "splitter":"scaffold",
    "split_uuid": split_uuid,
    "previously_split": "True",

    "featurizer": "computed_descriptors",
    "descriptor_type" : descriptor_type,
    "transformers": "True",

    "search_type": "hyperopt",
    "model_type": "RF|10",
    "rfe": "uniformint|8,512",
    "rfd": "uniformint|6,32",
    "rff": "uniformint|8,200",

    "result_dir": f"./{best_model_dir},./{model_dir}"
}

In tutorial 4 we directly imported the `parameter_parser` and `model_pipeline` objects to parse the `config` dict and train a single model. Here, we use `hyperparameter_search_wrapper` to handle many models for us. First we build the search by creating a list of parameters to use, and then we run the search.

In [6]:
import atomsci.ddm.utils.hyperparam_search_wrapper as hsw
import importlib
importlib.reload(hsw)
ampl_param = hsw.parse_params(params)
hs = hsw.build_search(ampl_param)
hs.run_search()

model_performance|train_r2|train_rms|valid_r2|valid_rms|test_r2|test_rms|model_params|model

rf_estimators: 170, rf_max_depth: 15, rf_max_feature: 119                                                                 
  0%|                                                                              | 0/10 [00:00<?, ?trial/s, best loss=?]

2024-02-28 18:01:08,366 ['dataset_hash'] are not part of the accepted list of parameters and will be ignored


num_model_tasks is deprecated and its value is ignored.                                                                   
RF model with computed_descriptors and rdkit_raw                                                                          
  0%|                                                                              | 0/10 [00:00<?, ?trial/s, best loss=?]

2024-02-28 18:01:08,422 Previous dataset split restored
  X = np.nan_to_num((X - self.X_means) * X_weight / self.X_stds)



model_performance|0.948|0.284|0.497|0.857|0.439|0.913|170_15_119|./dataset/SLC6A3_models/SLC6A3_Ki_curated_model_520295f8-2ac5-45d8-9433-d3a0e96ccf0c.tar.gz

rf_estimators: 485, rf_max_depth: 17, rf_max_feature: 123                                                                 
 10%|█████▏                                              | 1/10 [00:00<00:07,  1.23trial/s, best loss: 0.5029012627563357]

2024-02-28 18:01:09,178 ['dataset_hash'] are not part of the accepted list of parameters and will be ignored


num_model_tasks is deprecated and its value is ignored.                                                                   
RF model with computed_descriptors and rdkit_raw                                                                          
 10%|█████▏                                              | 1/10 [00:00<00:07,  1.23trial/s, best loss: 0.5029012627563357]

2024-02-28 18:01:09,232 Previous dataset split restored
  X = np.nan_to_num((X - self.X_means) * X_weight / self.X_stds)



model_performance|0.950|0.279|0.483|0.869|0.443|0.909|485_17_123|./dataset/SLC6A3_models/SLC6A3_Ki_curated_model_cc586cf1-9cbe-4f7c-bdff-af582c4649b9.tar.gz

rf_estimators: 143, rf_max_depth: 32, rf_max_feature: 185                                                                 
 20%|██████████▍                                         | 2/10 [00:02<00:12,  1.53s/trial, best loss: 0.5029012627563357]

2024-02-28 18:01:11,212 ['dataset_hash'] are not part of the accepted list of parameters and will be ignored


num_model_tasks is deprecated and its value is ignored.                                                                   
RF model with computed_descriptors and rdkit_raw                                                                          
 20%|██████████▍                                         | 2/10 [00:02<00:12,  1.53s/trial, best loss: 0.5029012627563357]

2024-02-28 18:01:11,265 Previous dataset split restored
  X = np.nan_to_num((X - self.X_means) * X_weight / self.X_stds)



model_performance|0.950|0.279|0.470|0.880|0.426|0.923|143_32_185|./dataset/SLC6A3_models/SLC6A3_Ki_curated_model_43ca0688-67e1-4abe-a5f3-32f0317876f8.tar.gz

rf_estimators: 421, rf_max_depth: 16, rf_max_feature: 41                                                                  
 30%|███████████████▌                                    | 3/10 [00:03<00:08,  1.20s/trial, best loss: 0.5029012627563357]

2024-02-28 18:01:12,012 ['dataset_hash'] are not part of the accepted list of parameters and will be ignored


num_model_tasks is deprecated and its value is ignored.                                                                   
RF model with computed_descriptors and rdkit_raw                                                                          
 30%|███████████████▌                                    | 3/10 [00:03<00:08,  1.20s/trial, best loss: 0.5029012627563357]

2024-02-28 18:01:12,065 Previous dataset split restored
  X = np.nan_to_num((X - self.X_means) * X_weight / self.X_stds)



model_performance|0.947|0.286|0.497|0.857|0.421|0.927|421_16_41|./dataset/SLC6A3_models/SLC6A3_Ki_curated_model_37a85d2a-1c0e-48aa-bcb9-f1b0a6107f29.tar.gz

rf_estimators: 382, rf_max_depth: 19, rf_max_feature: 92                                                                  
 40%|████████████████████▊                               | 4/10 [00:05<00:07,  1.33s/trial, best loss: 0.5029012627563357]

2024-02-28 18:01:13,549 ['dataset_hash'] are not part of the accepted list of parameters and will be ignored


num_model_tasks is deprecated and its value is ignored.                                                                   
RF model with computed_descriptors and rdkit_raw                                                                          
 40%|████████████████████▊                               | 4/10 [00:05<00:07,  1.33s/trial, best loss: 0.5029012627563357]

2024-02-28 18:01:13,608 Previous dataset split restored
  X = np.nan_to_num((X - self.X_means) * X_weight / self.X_stds)



model_performance|0.951|0.277|0.485|0.867|0.433|0.917|382_19_92|./dataset/SLC6A3_models/SLC6A3_Ki_curated_model_3eb607ee-cf68-4eac-bc0c-92689443a278.tar.gz

rf_estimators: 465, rf_max_depth: 17, rf_max_feature: 105                                                                 
 50%|██████████████████████████                          | 5/10 [00:06<00:07,  1.44s/trial, best loss: 0.5029012627563357]

2024-02-28 18:01:15,172 ['dataset_hash'] are not part of the accepted list of parameters and will be ignored


num_model_tasks is deprecated and its value is ignored.                                                                   
RF model with computed_descriptors and rdkit_raw                                                                          
 50%|██████████████████████████                          | 5/10 [00:06<00:07,  1.44s/trial, best loss: 0.5029012627563357]

2024-02-28 18:01:15,226 Previous dataset split restored
  X = np.nan_to_num((X - self.X_means) * X_weight / self.X_stds)



model_performance|0.950|0.279|0.484|0.868|0.435|0.915|465_17_105|./dataset/SLC6A3_models/SLC6A3_Ki_curated_model_726c1093-17e4-4e5e-9f8d-54eb5babec78.tar.gz

rf_estimators: 226, rf_max_depth: 12, rf_max_feature: 79                                                                  
 60%|███████████████████████████████▏                    | 6/10 [00:08<00:06,  1.63s/trial, best loss: 0.5029012627563357]

2024-02-28 18:01:17,171 ['dataset_hash'] are not part of the accepted list of parameters and will be ignored


num_model_tasks is deprecated and its value is ignored.                                                                   
RF model with computed_descriptors and rdkit_raw                                                                          
 60%|███████████████████████████████▏                    | 6/10 [00:08<00:06,  1.63s/trial, best loss: 0.5029012627563357]

2024-02-28 18:01:17,223 Previous dataset split restored
  X = np.nan_to_num((X - self.X_means) * X_weight / self.X_stds)



model_performance|0.939|0.309|0.478|0.873|0.427|0.922|226_12_79|./dataset/SLC6A3_models/SLC6A3_Ki_curated_model_00c797ec-e41d-4a5f-b74d-cf41ffc5fc9b.tar.gz

rf_estimators: 87, rf_max_depth: 32, rf_max_feature: 195                                                                  
 70%|████████████████████████████████████▍               | 7/10 [00:09<00:04,  1.37s/trial, best loss: 0.5029012627563357]

2024-02-28 18:01:18,023 ['dataset_hash'] are not part of the accepted list of parameters and will be ignored


num_model_tasks is deprecated and its value is ignored.                                                                   
RF model with computed_descriptors and rdkit_raw                                                                          
 70%|████████████████████████████████████▍               | 7/10 [00:09<00:04,  1.37s/trial, best loss: 0.5029012627563357]

2024-02-28 18:01:18,072 Previous dataset split restored
  X = np.nan_to_num((X - self.X_means) * X_weight / self.X_stds)



model_performance|0.948|0.283|0.470|0.880|0.446|0.907|87_32_195|./dataset/SLC6A3_models/SLC6A3_Ki_curated_model_77f14ea8-684e-405f-bd0c-ed584d94250f.tar.gz

rf_estimators: 86, rf_max_depth: 26, rf_max_feature: 158                                                                  
 80%|█████████████████████████████████████████▌          | 8/10 [00:10<00:02,  1.11s/trial, best loss: 0.5029012627563357]

2024-02-28 18:01:18,574 ['dataset_hash'] are not part of the accepted list of parameters and will be ignored


num_model_tasks is deprecated and its value is ignored.                                                                   
RF model with computed_descriptors and rdkit_raw                                                                          
 80%|█████████████████████████████████████████▌          | 8/10 [00:10<00:02,  1.11s/trial, best loss: 0.5029012627563357]

2024-02-28 18:01:18,626 Previous dataset split restored
  X = np.nan_to_num((X - self.X_means) * X_weight / self.X_stds)



model_performance|0.949|0.280|0.481|0.871|0.434|0.917|86_26_158|./dataset/SLC6A3_models/SLC6A3_Ki_curated_model_fe5b0e3c-f7eb-4887-a1c2-bf1a11a12ef9.tar.gz

rf_estimators: 500, rf_max_depth: 14, rf_max_feature: 125                                                                 
 90%|██████████████████████████████████████████████▊     | 9/10 [00:10<00:00,  1.09trial/s, best loss: 0.5029012627563357]

2024-02-28 18:01:19,073 ['dataset_hash'] are not part of the accepted list of parameters and will be ignored


num_model_tasks is deprecated and its value is ignored.                                                                   
RF model with computed_descriptors and rdkit_raw                                                                          
 90%|██████████████████████████████████████████████▊     | 9/10 [00:10<00:00,  1.09trial/s, best loss: 0.5029012627563357]

2024-02-28 18:01:19,125 Previous dataset split restored
  X = np.nan_to_num((X - self.X_means) * X_weight / self.X_stds)



model_performance|0.947|0.288|0.490|0.863|0.437|0.914|500_14_125|./dataset/SLC6A3_models/SLC6A3_Ki_curated_model_d3388e81-d151-420d-b55c-b10627d1c71e.tar.gz

100%|███████████████████████████████████████████████████| 10/10 [00:12<00:00,  1.27s/trial, best loss: 0.5029012627563357]
Generating the performance -- iteration table and Copy the best model tarball.
Best model: ./dataset/SLC6A3_models/SLC6A3_Ki_curated_model_520295f8-2ac5-45d8-9433-d3a0e96ccf0c.tar.gz, valid R2: 0.49709873724366427


The top scoring model will be saved in `dataset/SLC6A3_models/best_models` along with a csv file
containing regression performance for all trained models.

All of the models are saved in `dataset/SLC6A3_models`. These models can be
explored using `get_filesystem_perf_results`. A full analysis of the hyperparameter performance is explored in `tutorial 7`.

In [8]:
import atomsci.ddm.pipeline.compare_models as cm

result_df = cm.get_filesystem_perf_results(
    result_dir=model_dir,
    pred_type='regression'
)

# sort by validation r2 score to see top performing models
result_df = result_df.sort_values(by='best_valid_r2_score', ascending=False)
result_df[['model_uuid','model_parameters_dict','best_valid_r2_score','best_test_r2_score']].head()

Found data for 12 models under dataset/SLC6A3_models


Unnamed: 0,model_uuid,model_parameters_dict,best_valid_r2_score,best_test_r2_score
2,520295f8-2ac5-45d8-9433-d3a0e96ccf0c,"{""rf_estimators"": 170, ""rf_max_depth"": 15, ""rf...",0.497099,0.43862
5,37a85d2a-1c0e-48aa-bcb9-f1b0a6107f29,"{""rf_estimators"": 421, ""rf_max_depth"": 16, ""rf...",0.496579,0.42124
11,d3388e81-d151-420d-b55c-b10627d1c71e,"{""rf_estimators"": 500, ""rf_max_depth"": 14, ""rf...",0.489879,0.437214
0,8afb64d6-993e-4d8b-9072-60dcb40d2c83,"{""rf_estimators"": 500, ""rf_max_depth"": null, ""...",0.489673,0.416391
6,3eb607ee-cf68-4eac-bc0c-92689443a278,"{""rf_estimators"": 382, ""rf_max_depth"": 19, ""rf...",0.484862,0.433234


### Examples for other parameters
Below are some parameters that can be used for neural networks, XGBoost models, fingerprint splits and ECFP features.

In [9]:
# NN models
nn_params = {
    "search_type": "hyperopt",
    "model_type": "NN|10",
    "lr": "loguniform|-13.8,-3",
    "ls": "uniformint|3|8,512",
    "dp": "uniform|3|0,0.4",
    "max_epochs":100
}

# params.update(nn_params)

# ampl_param = hsw.parse_params(params)
# hs = hsw.build_search(ampl_param)
# hs.run_search()

In [10]:
# xgboost models
xg_params = {
    "search_type": "hyperopt",
    "model_type": "xgboost|10",
    "xgbg": "uniform|0,0.2",
    "xgbl": "loguniform|-2,2",
}

In [11]:
# fingerprint split models - use Tutorial 3 to create a fingerprint split
fp_split_uuid="be60c264-6ac0-4841-a6b6-41bf846e4ae4"

fp_params = {
    "splitter":"fingerprint",
    "split_uuid": fp_split_uuid,
    "previously_split": "True",
}

In [12]:
# Morgan fingerprint features
ecfp_params = {
    "featurizer": "ecfp",
    "ecfp_radius" : 2,
    "ecfp_size" : 1024,
    "transformers": "True",
}

In tutorial 7, we analyze the performance of these large sets of models to select the best hyperparameters for production models.