# Hyperparameter Optimization
Hyperparameters detail specifics of the learning process or model
that are not learned in the training process. For example, the 
number of random trees is a hyperparameter for a random forest.

The choice for hyperpramters strongly influence model performance,
so it is important to be able to optimize them as well. AMPL
offers a variety of hyperparameter optimization methods including
random sampling, grid search, and Bayesian optimization. Here
we demonstrate an optimization method using Bayesian optimization.

## JSON Settings.
- `'hyperparam':True` This setting indicates that we are performing
a hyper parameter search instead of just training one model.
- `'search_type':'hyperopt'` This specifies the hyperparameter
search method. Other options include grid, random, and geometric.
Specifications for each hyperparameter search method is different,
please refer to the full documentation.
- `'model_type':'RF|10'` This means AMPL will try 10 times to 
find the best set of hyperparameters using random forests. In 
production this parameter could be set to 100 or more.
- `'rfe':'uniformint|8,512'` The Bayesian optimizer will uniformly
search between 8 and 512 for the best number of random forest estimators.
Similarly `rfd` stands for random forest depth and `rff` stands for
random forest features.
- `result_dir` Now expects two parameters. The first directory
will contain the best trained models while the second directory will
contain all models trained in the search.

In [1]:
import atomsci.ddm.utils.hyperparam_search_wrapper as hsw
import os

descriptor_type = 'rdkit_raw'
output_dir = 'output_kcna3_rdkit_raw'
tmp_dir = 'tmp_kcna3_rdkit_raw'
config = {
    "hyperparam": "True",
    "prediction_type": "regression",

    "dataset_key": "dataset/curated_kcna3_ic50.csv",
    "id_col": "compound_id",
    "smiles_col": "base_rdkit_smiles",
    "response_cols": "avg_pIC50",

    "splitter":"scaffold",
    "split_uuid": "c0313c63-8936-4297-925b-ee537b66dd89",
    "previously_split": "True",

    "featurizer": "computed_descriptors",
    "descriptor_type" : descriptor_type,
    "transformers": "True",

    "search_type": "hyperopt",
    "model_type": "RF|10",
    "rfe": "uniformint|8,512",
    "rfd": "uniformint|8,512",
    "rff": "uniformint|8,200",

    "result_dir": f"./{output_dir},./{tmp_dir}"
}


if not os.path.exists(f'./{output_dir}'):
    os.mkdir(f'./{output_dir}')

params = hsw.parse_params(config)
hs = hsw.build_search(params)
hs.run_search()


  from .autonotebook import tqdm as notebook_tqdm
Skipped loading modules with pytorch-geometric dependency, missing a dependency. No module named 'torch_geometric'
Skipped loading modules with pytorch-geometric dependency, missing a dependency. cannot import name 'DMPNN' from 'deepchem.models.torch_models' (/usr/WS1/he6/AMPL_virtualenv_1.6/lib/python3.9/site-packages/deepchem/models/torch_models/__init__.py)
Skipped loading modules with pytorch-lightning dependency, missing a dependency. No module named 'pytorch_lightning'
Skipped loading some Jax models, missing a dependency. jax requires jaxlib to be installed. See https://github.com/google/jax#installation for installation instructions.
2024-01-04 15:29:28,218 Created a dataset hash '389b161b7a4eb2304323a7dfddacfacc' from dataset_key '/usr/WS1/he6/code/ATOM/AMPL1.6/atomsci/ddm/examples/tutorials2023/dataset/curated_kcna3_ic50.csv'


model_performance|train_r2|train_rms|valid_r2|valid_rms|test_r2|test_rms|model_params|model

rf_estimators: 153, rf_max_depth: 12, rf_max_feature: 136
  0%|          | 0/10 [00:00<?, ?trial/s, best loss=?]

2024-01-04 15:29:28,307 ['dataset_hash'] are not part of the accepted list of parameters and will be ignored
2024-01-04 15:29:28,333 Created a dataset hash '389b161b7a4eb2304323a7dfddacfacc' from dataset_key '/usr/WS1/he6/code/ATOM/AMPL1.6/atomsci/ddm/examples/tutorials2023/dataset/curated_kcna3_ic50.csv'


num_model_tasks is deprecated and its value is ignored.
RF model with computed_descriptors and rdkit_raw      
  0%|          | 0/10 [00:00<?, ?trial/s, best loss=?]

2024-01-04 15:29:28,389 Reading descriptor spec table from /usr/WS1/he6/code/ATOM/AMPL1.6/atomsci/ddm/data/descriptor_sets_sources_by_descr_type.csv
2024-01-04 15:29:28,400 Attempting to load featurized dataset
2024-01-04 15:29:28,440 Got dataset, attempting to extract data
2024-01-04 15:29:28,507 Creating deepchem dataset
2024-01-04 15:29:28,508 Using prefeaturized data; number of features = 200
2024-01-04 15:29:28,523 Previous dataset split restored
  X_m2 += dx * (X - X_means)

2024-01-04 15:29:28,539 Wrote transformers to ./tmp_kcna3_rdkit_raw/curated_kcna3_ic50/RF_computed_descriptors_scaffold_regression/e594fbaa-99b5-401b-a995-fcce5661c3c5/transformers.pkl
2024-01-04 15:29:28,541 Transforming response data
2024-01-04 15:29:28,541 Transforming feature data
  X = np.nan_to_num((X - self.X_means) * X_weight / self.X_stds)

2024-01-04 15:29:28,568 Transforming response data
2024-01-04 15:29:28,569 Transforming feature data
2024-01-04 15:29:28,571 Transforming response data
2024-01-04

model_performance|0.975|0.289|0.717|0.762|0.821|0.719|153_12_136|./tmp_kcna3_rdkit_raw/curated_kcna3_ic50_model_e594fbaa-99b5-401b-a995-fcce5661c3c5.tar.gz

rf_estimators: 447, rf_max_depth: 229, rf_max_feature: 67                        
 10%|█         | 1/10 [00:01<00:08,  1.00trial/s, best loss: 0.28341222494691065]

2024-01-04 15:29:29,304 ['dataset_hash'] are not part of the accepted list of parameters and will be ignored
2024-01-04 15:29:29,332 Created a dataset hash '389b161b7a4eb2304323a7dfddacfacc' from dataset_key '/usr/WS1/he6/code/ATOM/AMPL1.6/atomsci/ddm/examples/tutorials2023/dataset/curated_kcna3_ic50.csv'


num_model_tasks is deprecated and its value is ignored.                          
RF model with computed_descriptors and rdkit_raw                                 
 10%|█         | 1/10 [00:01<00:08,  1.00trial/s, best loss: 0.28341222494691065]

2024-01-04 15:29:29,366 Attempting to load featurized dataset
2024-01-04 15:29:29,419 Got dataset, attempting to extract data
2024-01-04 15:29:29,491 Creating deepchem dataset
2024-01-04 15:29:29,493 Using prefeaturized data; number of features = 200
2024-01-04 15:29:29,509 Previous dataset split restored
  X_m2 += dx * (X - X_means)

2024-01-04 15:29:29,523 Wrote transformers to ./tmp_kcna3_rdkit_raw/curated_kcna3_ic50/RF_computed_descriptors_scaffold_regression/3f8a51d9-034a-4934-9dea-5e8657aa28c7/transformers.pkl
2024-01-04 15:29:29,524 Transforming response data
2024-01-04 15:29:29,525 Transforming feature data
  X = np.nan_to_num((X - self.X_means) * X_weight / self.X_stds)

2024-01-04 15:29:29,528 Transforming response data
2024-01-04 15:29:29,528 Transforming feature data
2024-01-04 15:29:29,530 Transforming response data
2024-01-04 15:29:29,530 Transforming feature data
2024-01-04 15:29:29,532 Fitting random forest model
2024-01-04 15:29:30,703 Fold 0: training r2_score = 0.976

model_performance|0.976|0.284|0.718|0.760|0.820|0.720|447_229_67|./tmp_kcna3_rdkit_raw/curated_kcna3_ic50_model_3f8a51d9-034a-4934-9dea-5e8657aa28c7.tar.gz

rf_estimators: 133, rf_max_depth: 368, rf_max_feature: 199                       
 20%|██        | 2/10 [00:03<00:13,  1.64s/trial, best loss: 0.28214858719782177]

2024-01-04 15:29:31,398 ['dataset_hash'] are not part of the accepted list of parameters and will be ignored
2024-01-04 15:29:31,426 Created a dataset hash '389b161b7a4eb2304323a7dfddacfacc' from dataset_key '/usr/WS1/he6/code/ATOM/AMPL1.6/atomsci/ddm/examples/tutorials2023/dataset/curated_kcna3_ic50.csv'


num_model_tasks is deprecated and its value is ignored.                          
RF model with computed_descriptors and rdkit_raw                                 
 20%|██        | 2/10 [00:03<00:13,  1.64s/trial, best loss: 0.28214858719782177]

2024-01-04 15:29:31,453 Attempting to load featurized dataset
2024-01-04 15:29:31,516 Got dataset, attempting to extract data
2024-01-04 15:29:31,590 Creating deepchem dataset
2024-01-04 15:29:31,591 Using prefeaturized data; number of features = 200
2024-01-04 15:29:31,610 Previous dataset split restored
  X_m2 += dx * (X - X_means)

2024-01-04 15:29:31,624 Wrote transformers to ./tmp_kcna3_rdkit_raw/curated_kcna3_ic50/RF_computed_descriptors_scaffold_regression/26ad8ca5-6385-4815-8b97-a2d9a02160fb/transformers.pkl
2024-01-04 15:29:31,625 Transforming response data
2024-01-04 15:29:31,626 Transforming feature data
  X = np.nan_to_num((X - self.X_means) * X_weight / self.X_stds)

2024-01-04 15:29:31,629 Transforming response data
2024-01-04 15:29:31,630 Transforming feature data
2024-01-04 15:29:31,631 Transforming response data
2024-01-04 15:29:31,632 Transforming feature data
2024-01-04 15:29:31,633 Fitting random forest model
2024-01-04 15:29:32,051 Fold 0: training r2_score = 0.976

model_performance|0.976|0.284|0.722|0.754|0.815|0.731|133_368_199|./tmp_kcna3_rdkit_raw/curated_kcna3_ic50_model_26ad8ca5-6385-4815-8b97-a2d9a02160fb.tar.gz

rf_estimators: 315, rf_max_depth: 94, rf_max_feature: 60                         
 30%|███       | 3/10 [00:04<00:09,  1.31s/trial, best loss: 0.2775742220882873]

2024-01-04 15:29:32,322 ['dataset_hash'] are not part of the accepted list of parameters and will be ignored
2024-01-04 15:29:32,348 Created a dataset hash '389b161b7a4eb2304323a7dfddacfacc' from dataset_key '/usr/WS1/he6/code/ATOM/AMPL1.6/atomsci/ddm/examples/tutorials2023/dataset/curated_kcna3_ic50.csv'


num_model_tasks is deprecated and its value is ignored.                         
RF model with computed_descriptors and rdkit_raw                                
 30%|███       | 3/10 [00:04<00:09,  1.31s/trial, best loss: 0.2775742220882873]

2024-01-04 15:29:32,374 Attempting to load featurized dataset
2024-01-04 15:29:32,421 Got dataset, attempting to extract data
2024-01-04 15:29:32,489 Creating deepchem dataset
2024-01-04 15:29:32,490 Using prefeaturized data; number of features = 200
2024-01-04 15:29:32,507 Previous dataset split restored
  X_m2 += dx * (X - X_means)

2024-01-04 15:29:32,522 Wrote transformers to ./tmp_kcna3_rdkit_raw/curated_kcna3_ic50/RF_computed_descriptors_scaffold_regression/4d010652-aae1-402f-8166-d201b56f2ebf/transformers.pkl
2024-01-04 15:29:32,523 Transforming response data
2024-01-04 15:29:32,523 Transforming feature data
  X = np.nan_to_num((X - self.X_means) * X_weight / self.X_stds)

2024-01-04 15:29:32,526 Transforming response data
2024-01-04 15:29:32,527 Transforming feature data
2024-01-04 15:29:32,528 Transforming response data
2024-01-04 15:29:32,529 Transforming feature data
2024-01-04 15:29:32,530 Fitting random forest model
2024-01-04 15:29:33,359 Fold 0: training r2_score = 0.976

model_performance|0.976|0.285|0.724|0.752|0.822|0.715|315_94_60|./tmp_kcna3_rdkit_raw/curated_kcna3_ic50_model_4d010652-aae1-402f-8166-d201b56f2ebf.tar.gz

rf_estimators: 432, rf_max_depth: 157, rf_max_feature: 47                        
 40%|████      | 4/10 [00:05<00:08,  1.41s/trial, best loss: 0.27628688389578426]

2024-01-04 15:29:33,872 ['dataset_hash'] are not part of the accepted list of parameters and will be ignored
2024-01-04 15:29:33,903 Created a dataset hash '389b161b7a4eb2304323a7dfddacfacc' from dataset_key '/usr/WS1/he6/code/ATOM/AMPL1.6/atomsci/ddm/examples/tutorials2023/dataset/curated_kcna3_ic50.csv'


num_model_tasks is deprecated and its value is ignored.                          
RF model with computed_descriptors and rdkit_raw                                 
 40%|████      | 4/10 [00:05<00:08,  1.41s/trial, best loss: 0.27628688389578426]

2024-01-04 15:29:33,929 Attempting to load featurized dataset
2024-01-04 15:29:33,974 Got dataset, attempting to extract data
2024-01-04 15:29:34,043 Creating deepchem dataset
2024-01-04 15:29:34,045 Using prefeaturized data; number of features = 200
2024-01-04 15:29:34,062 Previous dataset split restored
  X_m2 += dx * (X - X_means)

2024-01-04 15:29:34,077 Wrote transformers to ./tmp_kcna3_rdkit_raw/curated_kcna3_ic50/RF_computed_descriptors_scaffold_regression/94ff98fa-e268-45c6-9a97-e2eefaca6104/transformers.pkl
2024-01-04 15:29:34,078 Transforming response data
2024-01-04 15:29:34,079 Transforming feature data
  X = np.nan_to_num((X - self.X_means) * X_weight / self.X_stds)

2024-01-04 15:29:34,082 Transforming response data
2024-01-04 15:29:34,082 Transforming feature data
2024-01-04 15:29:34,084 Transforming response data
2024-01-04 15:29:34,084 Transforming feature data
2024-01-04 15:29:34,086 Fitting random forest model
2024-01-04 15:29:35,213 Fold 0: training r2_score = 0.976

model_performance|0.976|0.286|0.722|0.754|0.826|0.708|432_157_47|./tmp_kcna3_rdkit_raw/curated_kcna3_ic50_model_94ff98fa-e268-45c6-9a97-e2eefaca6104.tar.gz

rf_estimators: 499, rf_max_depth: 346, rf_max_feature: 118                       
 50%|█████     | 5/10 [00:07<00:08,  1.62s/trial, best loss: 0.27628688389578426]

2024-01-04 15:29:35,878 ['dataset_hash'] are not part of the accepted list of parameters and will be ignored
2024-01-04 15:29:35,926 Created a dataset hash '389b161b7a4eb2304323a7dfddacfacc' from dataset_key '/usr/WS1/he6/code/ATOM/AMPL1.6/atomsci/ddm/examples/tutorials2023/dataset/curated_kcna3_ic50.csv'


num_model_tasks is deprecated and its value is ignored.                          
RF model with computed_descriptors and rdkit_raw                                 
 50%|█████     | 5/10 [00:07<00:08,  1.62s/trial, best loss: 0.27628688389578426]

2024-01-04 15:29:35,994 Attempting to load featurized dataset
2024-01-04 15:29:36,040 Got dataset, attempting to extract data
2024-01-04 15:29:36,106 Creating deepchem dataset
2024-01-04 15:29:36,107 Using prefeaturized data; number of features = 200
2024-01-04 15:29:36,123 Previous dataset split restored
  X_m2 += dx * (X - X_means)

2024-01-04 15:29:36,140 Wrote transformers to ./tmp_kcna3_rdkit_raw/curated_kcna3_ic50/RF_computed_descriptors_scaffold_regression/ab1fc0f8-7ac2-495d-91c7-76107cd75cc5/transformers.pkl
2024-01-04 15:29:36,141 Transforming response data
2024-01-04 15:29:36,142 Transforming feature data
  X = np.nan_to_num((X - self.X_means) * X_weight / self.X_stds)

2024-01-04 15:29:36,144 Transforming response data
2024-01-04 15:29:36,145 Transforming feature data
2024-01-04 15:29:36,147 Transforming response data
2024-01-04 15:29:36,147 Transforming feature data
2024-01-04 15:29:36,149 Fitting random forest model
2024-01-04 15:29:37,463 Fold 0: training r2_score = 0.976

model_performance|0.976|0.282|0.720|0.757|0.816|0.727|499_346_118|./tmp_kcna3_rdkit_raw/curated_kcna3_ic50_model_ab1fc0f8-7ac2-495d-91c7-76107cd75cc5.tar.gz

rf_estimators: 242, rf_max_depth: 289, rf_max_feature: 161                       
 60%|██████    | 6/10 [00:10<00:07,  1.90s/trial, best loss: 0.27628688389578426]

2024-01-04 15:29:38,326 ['dataset_hash'] are not part of the accepted list of parameters and will be ignored
2024-01-04 15:29:38,352 Created a dataset hash '389b161b7a4eb2304323a7dfddacfacc' from dataset_key '/usr/WS1/he6/code/ATOM/AMPL1.6/atomsci/ddm/examples/tutorials2023/dataset/curated_kcna3_ic50.csv'


num_model_tasks is deprecated and its value is ignored.                          
RF model with computed_descriptors and rdkit_raw                                 
 60%|██████    | 6/10 [00:10<00:07,  1.90s/trial, best loss: 0.27628688389578426]

2024-01-04 15:29:38,400 Attempting to load featurized dataset
2024-01-04 15:29:38,451 Got dataset, attempting to extract data
2024-01-04 15:29:38,518 Creating deepchem dataset
2024-01-04 15:29:38,520 Using prefeaturized data; number of features = 200
2024-01-04 15:29:38,536 Previous dataset split restored
  X_m2 += dx * (X - X_means)

2024-01-04 15:29:38,570 Wrote transformers to ./tmp_kcna3_rdkit_raw/curated_kcna3_ic50/RF_computed_descriptors_scaffold_regression/09e8c09e-186e-4f4c-9f72-e3960e6145fb/transformers.pkl
2024-01-04 15:29:38,571 Transforming response data
2024-01-04 15:29:38,572 Transforming feature data
  X = np.nan_to_num((X - self.X_means) * X_weight / self.X_stds)

2024-01-04 15:29:38,575 Transforming response data
2024-01-04 15:29:38,576 Transforming feature data
2024-01-04 15:29:38,577 Transforming response data
2024-01-04 15:29:38,578 Transforming feature data
2024-01-04 15:29:38,579 Fitting random forest model
2024-01-04 15:29:39,274 Fold 0: training r2_score = 0.975

model_performance|0.975|0.286|0.715|0.764|0.817|0.727|242_289_161|./tmp_kcna3_rdkit_raw/curated_kcna3_ic50_model_09e8c09e-186e-4f4c-9f72-e3960e6145fb.tar.gz

rf_estimators: 160, rf_max_depth: 247, rf_max_feature: 103                       
 70%|███████   | 7/10 [00:11<00:05,  1.74s/trial, best loss: 0.27628688389578426]

2024-01-04 15:29:39,727 ['dataset_hash'] are not part of the accepted list of parameters and will be ignored
2024-01-04 15:29:39,754 Created a dataset hash '389b161b7a4eb2304323a7dfddacfacc' from dataset_key '/usr/WS1/he6/code/ATOM/AMPL1.6/atomsci/ddm/examples/tutorials2023/dataset/curated_kcna3_ic50.csv'


num_model_tasks is deprecated and its value is ignored.                          
RF model with computed_descriptors and rdkit_raw                                 
 70%|███████   | 7/10 [00:11<00:05,  1.74s/trial, best loss: 0.27628688389578426]

2024-01-04 15:29:39,786 Attempting to load featurized dataset
2024-01-04 15:29:39,843 Got dataset, attempting to extract data
2024-01-04 15:29:39,913 Creating deepchem dataset
2024-01-04 15:29:39,915 Using prefeaturized data; number of features = 200
2024-01-04 15:29:39,936 Previous dataset split restored
  X_m2 += dx * (X - X_means)

2024-01-04 15:29:39,950 Wrote transformers to ./tmp_kcna3_rdkit_raw/curated_kcna3_ic50/RF_computed_descriptors_scaffold_regression/c165d4d6-e0be-4be9-a6aa-301539c26fbb/transformers.pkl
2024-01-04 15:29:39,951 Transforming response data
2024-01-04 15:29:39,952 Transforming feature data
  X = np.nan_to_num((X - self.X_means) * X_weight / self.X_stds)

2024-01-04 15:29:39,954 Transforming response data
2024-01-04 15:29:39,955 Transforming feature data
2024-01-04 15:29:39,957 Transforming response data
2024-01-04 15:29:39,957 Transforming feature data
2024-01-04 15:29:39,959 Fitting random forest model
2024-01-04 15:29:40,423 Fold 0: training r2_score = 0.975

model_performance|0.975|0.290|0.726|0.749|0.820|0.721|160_247_103|./tmp_kcna3_rdkit_raw/curated_kcna3_ic50_model_c165d4d6-e0be-4be9-a6aa-301539c26fbb.tar.gz

rf_estimators: 332, rf_max_depth: 173, rf_max_feature: 175                       
 80%|████████  | 8/10 [00:12<00:03,  1.54s/trial, best loss: 0.27398817546977505]

2024-01-04 15:29:40,831 ['dataset_hash'] are not part of the accepted list of parameters and will be ignored
2024-01-04 15:29:40,861 Created a dataset hash '389b161b7a4eb2304323a7dfddacfacc' from dataset_key '/usr/WS1/he6/code/ATOM/AMPL1.6/atomsci/ddm/examples/tutorials2023/dataset/curated_kcna3_ic50.csv'


num_model_tasks is deprecated and its value is ignored.                          
RF model with computed_descriptors and rdkit_raw                                 
 80%|████████  | 8/10 [00:12<00:03,  1.54s/trial, best loss: 0.27398817546977505]

2024-01-04 15:29:40,890 Attempting to load featurized dataset
2024-01-04 15:29:40,950 Got dataset, attempting to extract data
2024-01-04 15:29:41,020 Creating deepchem dataset
2024-01-04 15:29:41,021 Using prefeaturized data; number of features = 200
2024-01-04 15:29:41,038 Previous dataset split restored
  X_m2 += dx * (X - X_means)

2024-01-04 15:29:41,052 Wrote transformers to ./tmp_kcna3_rdkit_raw/curated_kcna3_ic50/RF_computed_descriptors_scaffold_regression/a8364b53-2d19-452e-ae6f-97c917704c1b/transformers.pkl
2024-01-04 15:29:41,053 Transforming response data
2024-01-04 15:29:41,054 Transforming feature data
  X = np.nan_to_num((X - self.X_means) * X_weight / self.X_stds)

2024-01-04 15:29:41,057 Transforming response data
2024-01-04 15:29:41,058 Transforming feature data
2024-01-04 15:29:41,059 Transforming response data
2024-01-04 15:29:41,060 Transforming feature data
2024-01-04 15:29:41,061 Fitting random forest model
2024-01-04 15:29:41,995 Fold 0: training r2_score = 0.976

model_performance|0.976|0.283|0.722|0.754|0.812|0.736|332_173_175|./tmp_kcna3_rdkit_raw/curated_kcna3_ic50_model_a8364b53-2d19-452e-ae6f-97c917704c1b.tar.gz

rf_estimators: 460, rf_max_depth: 61, rf_max_feature: 158                        
 90%|█████████ | 9/10 [00:14<00:01,  1.59s/trial, best loss: 0.27398817546977505]

2024-01-04 15:29:42,546 ['dataset_hash'] are not part of the accepted list of parameters and will be ignored
2024-01-04 15:29:42,570 Created a dataset hash '389b161b7a4eb2304323a7dfddacfacc' from dataset_key '/usr/WS1/he6/code/ATOM/AMPL1.6/atomsci/ddm/examples/tutorials2023/dataset/curated_kcna3_ic50.csv'


num_model_tasks is deprecated and its value is ignored.                          
RF model with computed_descriptors and rdkit_raw                                 
 90%|█████████ | 9/10 [00:14<00:01,  1.59s/trial, best loss: 0.27398817546977505]

2024-01-04 15:29:42,602 Attempting to load featurized dataset
2024-01-04 15:29:42,651 Got dataset, attempting to extract data
2024-01-04 15:29:42,720 Creating deepchem dataset
2024-01-04 15:29:42,721 Using prefeaturized data; number of features = 200
2024-01-04 15:29:42,741 Previous dataset split restored
  X_m2 += dx * (X - X_means)

2024-01-04 15:29:42,755 Wrote transformers to ./tmp_kcna3_rdkit_raw/curated_kcna3_ic50/RF_computed_descriptors_scaffold_regression/ea14600a-90bc-4fbb-88b1-af8fc2204344/transformers.pkl
2024-01-04 15:29:42,756 Transforming response data
2024-01-04 15:29:42,757 Transforming feature data
  X = np.nan_to_num((X - self.X_means) * X_weight / self.X_stds)

2024-01-04 15:29:42,760 Transforming response data
2024-01-04 15:29:42,760 Transforming feature data
2024-01-04 15:29:42,762 Transforming response data
2024-01-04 15:29:42,763 Transforming feature data
2024-01-04 15:29:42,764 Fitting random forest model
2024-01-04 15:29:44,033 Fold 0: training r2_score = 0.976

model_performance|0.976|0.282|0.725|0.750|0.817|0.726|460_61_158|./tmp_kcna3_rdkit_raw/curated_kcna3_ic50_model_ea14600a-90bc-4fbb-88b1-af8fc2204344.tar.gz

100%|██████████| 10/10 [00:16<00:00,  1.64s/trial, best loss: 0.27398817546977505]
Generating the performance -- iteration table and Copy the best model tarball.
Best model: ./tmp_kcna3_rdkit_raw/curated_kcna3_ic50_model_c165d4d6-e0be-4be9-a6aa-301539c26fbb.tar.gz, valid R2: 0.726011824530225


The best model will be saved in `output_kcna3_rdkit_raw` along with a csv file
containing regression performance for all trained models.

The rest of the models are saved in `tmp_kcna3_rdkit_raw`. These models can be
explored using `get_filesystem_perf_results`.

In [3]:
import atomsci.ddm.pipeline.compare_models as cm

result_df = cm.get_filesystem_perf_results(
    result_dir='tmp_kcna3_rdkit_raw',
    pred_type='regression'
)

# sort by validation r2 score to find the best model
result_df = result_df.sort_values(by='best_valid_r2_score', ascending=False)



Found data for 10 models under tmp_kcna3_rdkit_raw


The column `model_parameters_dict` contains hyperparamters used for the best model.

In [4]:
result_df.iloc[0].model_parameters_dict

'{"rf_estimators": 160, "rf_max_depth": 247, "rf_max_features": 103}'