# Hyperparameter Optimization
Hyperparameters detail specifics of the learning process or model
that are not learned in the training process. For example, the 
number of random trees is a hyperparameter for a random forest
whereas a parameter for a random forest is which features go into
a tree node and where the split points are for each feature.

The choice for hyperprameters strongly influence model performance,
so it is important to be able to optimize them as well. [AMPL](https://github.com/ATOMScience-org/AMPL)
offers a variety of hyperparameter optimization methods including
random sampling, grid search, and Bayesian optimization. 

Here we demonstrate the following:
- Build a JSON config to perform a hyperparameter optimization for
a random forest using Bayesian optimization.
- Perform the optimization process.
- Select the best model and display optimal hyperparameter choices.

## JSON Settings.
- `'hyperparam':True` This setting indicates that we are performing
a hyperparameter search instead of just training one model.
- `'search_type':'hyperopt'` This specifies the hyperparameter
search method. Other options include grid, random, and geometric.
Specifications for each hyperparameter search method is different,
please refer to the full documentation. Here we are using the
Bayesian optimization method.
- `'model_type':'RF|10'` This means [AMPL](https://github.com/ATOMScience-org/AMPL) will try 10 times to 
find the best set of hyperparameters using random forests. In 
production this parameter could be set to 100 or more.
- `'rfe':'uniformint|8,512'` The Bayesian optimizer will uniformly
search between 8 and 512 for the best number of random forest estimators.
Similarly `rfd` stands for random forest depth and `rff` stands for
random forest features.
- `result_dir` Now expects two parameters. The first directory
will contain the best trained models while the second directory will
contain all models trained in the search.

In [1]:
import atomsci.ddm.utils.hyperparam_search_wrapper as hsw
import os

descriptor_type = 'rdkit_raw'
output_dir = 'output_kcna3_rdkit_raw'
tmp_dir = 'tmp_kcna3_rdkit_raw'
split_uuid = "3c4e7b81-35e8-49c1-97c8-6a12faa36df4"

config = {
    "hyperparam": "True",
    "prediction_type": "regression",

    "dataset_key": "dataset/curated_kcna3_ic50.csv",
    "id_col": "compound_id",
    "smiles_col": "base_rdkit_smiles",
    "response_cols": "avg_pIC50",

    "splitter":"scaffold",
    "split_uuid": split_uuid,
    "previously_split": "True",

    "featurizer": "computed_descriptors",
    "descriptor_type" : descriptor_type,
    "transformers": "True",

    "search_type": "hyperopt",
    "model_type": "RF|10",
    "rfe": "uniformint|8,512",
    "rfd": "uniformint|6,32",
    "rff": "uniformint|8,200",

    "result_dir": f"./{output_dir},./{tmp_dir}"
}


if not os.path.exists(f'./{output_dir}'):
    os.mkdir(f'./{output_dir}')

params = hsw.parse_params(config)
hs = hsw.build_search(params)
hs.run_search()


  from .autonotebook import tqdm as notebook_tqdm
Skipped loading modules with pytorch-geometric dependency, missing a dependency. No module named 'torch_geometric'
Skipped loading modules with pytorch-geometric dependency, missing a dependency. cannot import name 'DMPNN' from 'deepchem.models.torch_models' (/usr/WS1/he6/AMPL_virtualenv_1.6/lib/python3.9/site-packages/deepchem/models/torch_models/__init__.py)
Skipped loading modules with pytorch-lightning dependency, missing a dependency. No module named 'pytorch_lightning'
Skipped loading some Jax models, missing a dependency. jax requires jaxlib to be installed. See https://github.com/google/jax#installation for installation instructions.
2024-01-10 14:33:16,473 Created a dataset hash '389b161b7a4eb2304323a7dfddacfacc' from dataset_key '/usr/WS1/he6/code/ATOM/AMPL1.6/atomsci/ddm/examples/tutorials2023/dataset/curated_kcna3_ic50.csv'


model_performance|train_r2|train_rms|valid_r2|valid_rms|test_r2|test_rms|model_params|model

rf_estimators: 254, rf_max_depth: 23, rf_max_feature: 106
  0%|          | 0/10 [00:00<?, ?trial/s, best loss=?]

2024-01-10 14:33:16,556 ['dataset_hash'] are not part of the accepted list of parameters and will be ignored
2024-01-10 14:33:16,581 Created a dataset hash '389b161b7a4eb2304323a7dfddacfacc' from dataset_key '/usr/WS1/he6/code/ATOM/AMPL1.6/atomsci/ddm/examples/tutorials2023/dataset/curated_kcna3_ic50.csv'


num_model_tasks is deprecated and its value is ignored.
RF model with computed_descriptors and rdkit_raw      
  0%|          | 0/10 [00:00<?, ?trial/s, best loss=?]

2024-01-10 14:33:16,610 Reading descriptor spec table from /usr/WS1/he6/code/ATOM/AMPL1.6/atomsci/ddm/data/descriptor_sets_sources_by_descr_type.csv
2024-01-10 14:33:16,620 Attempting to load featurized dataset
2024-01-10 14:33:16,655 Got dataset, attempting to extract data
2024-01-10 14:33:16,725 Creating deepchem dataset
2024-01-10 14:33:16,726 Using prefeaturized data; number of features = 200
2024-01-10 14:33:16,741 Previous dataset split restored
  X_m2 += dx * (X - X_means)

2024-01-10 14:33:16,759 Wrote transformers to ./tmp_kcna3_rdkit_raw/curated_kcna3_ic50/RF_computed_descriptors_scaffold_regression/d629b010-081d-4ac2-816e-e591bd0a99e1/transformers.pkl
2024-01-10 14:33:16,760 Transforming response data
2024-01-10 14:33:16,761 Transforming feature data
  X = np.nan_to_num((X - self.X_means) * X_weight / self.X_stds)

2024-01-10 14:33:16,765 Transforming response data
2024-01-10 14:33:16,766 Transforming feature data
2024-01-10 14:33:16,767 Transforming response data
2024-01-10

model_performance|0.976|0.282|0.724|0.752|0.820|0.721|254_23_106|./tmp_kcna3_rdkit_raw/curated_kcna3_ic50_model_d629b010-081d-4ac2-816e-e591bd0a99e1.tar.gz

rf_estimators: 305, rf_max_depth: 8, rf_max_feature: 121                         
 10%|█         | 1/10 [00:01<00:11,  1.33s/trial, best loss: 0.27644554399947097]

2024-01-10 14:33:17,884 ['dataset_hash'] are not part of the accepted list of parameters and will be ignored
2024-01-10 14:33:17,906 Created a dataset hash '389b161b7a4eb2304323a7dfddacfacc' from dataset_key '/usr/WS1/he6/code/ATOM/AMPL1.6/atomsci/ddm/examples/tutorials2023/dataset/curated_kcna3_ic50.csv'


num_model_tasks is deprecated and its value is ignored.                          
RF model with computed_descriptors and rdkit_raw                                 
 10%|█         | 1/10 [00:01<00:11,  1.33s/trial, best loss: 0.27644554399947097]

2024-01-10 14:33:17,932 Attempting to load featurized dataset
2024-01-10 14:33:17,966 Got dataset, attempting to extract data
2024-01-10 14:33:18,032 Creating deepchem dataset
2024-01-10 14:33:18,033 Using prefeaturized data; number of features = 200
2024-01-10 14:33:18,047 Previous dataset split restored
  X_m2 += dx * (X - X_means)

2024-01-10 14:33:18,061 Wrote transformers to ./tmp_kcna3_rdkit_raw/curated_kcna3_ic50/RF_computed_descriptors_scaffold_regression/36bf605d-f10b-4916-ae03-ae676c3e120e/transformers.pkl
2024-01-10 14:33:18,062 Transforming response data
2024-01-10 14:33:18,063 Transforming feature data
  X = np.nan_to_num((X - self.X_means) * X_weight / self.X_stds)

2024-01-10 14:33:18,066 Transforming response data
2024-01-10 14:33:18,067 Transforming feature data
2024-01-10 14:33:18,069 Transforming response data
2024-01-10 14:33:18,070 Transforming feature data
2024-01-10 14:33:18,072 Fitting random forest model
2024-01-10 14:33:18,848 Fold 0: training r2_score = 0.968

model_performance|0.968|0.329|0.720|0.756|0.814|0.731|305_8_121|./tmp_kcna3_rdkit_raw/curated_kcna3_ic50_model_36bf605d-f10b-4916-ae03-ae676c3e120e.tar.gz

rf_estimators: 430, rf_max_depth: 14, rf_max_feature: 26                         
 20%|██        | 2/10 [00:02<00:10,  1.32s/trial, best loss: 0.27644554399947097]

2024-01-10 14:33:19,189 ['dataset_hash'] are not part of the accepted list of parameters and will be ignored
2024-01-10 14:33:19,211 Created a dataset hash '389b161b7a4eb2304323a7dfddacfacc' from dataset_key '/usr/WS1/he6/code/ATOM/AMPL1.6/atomsci/ddm/examples/tutorials2023/dataset/curated_kcna3_ic50.csv'


num_model_tasks is deprecated and its value is ignored.                          
RF model with computed_descriptors and rdkit_raw                                 
 20%|██        | 2/10 [00:02<00:10,  1.32s/trial, best loss: 0.27644554399947097]

2024-01-10 14:33:19,232 Attempting to load featurized dataset
2024-01-10 14:33:19,267 Got dataset, attempting to extract data
2024-01-10 14:33:19,332 Creating deepchem dataset
2024-01-10 14:33:19,333 Using prefeaturized data; number of features = 200
2024-01-10 14:33:19,347 Previous dataset split restored
  X_m2 += dx * (X - X_means)

2024-01-10 14:33:19,360 Wrote transformers to ./tmp_kcna3_rdkit_raw/curated_kcna3_ic50/RF_computed_descriptors_scaffold_regression/1e31a914-724a-4282-b3fc-58d2e68e6830/transformers.pkl
2024-01-10 14:33:19,361 Transforming response data
2024-01-10 14:33:19,362 Transforming feature data
  X = np.nan_to_num((X - self.X_means) * X_weight / self.X_stds)

2024-01-10 14:33:19,365 Transforming response data
2024-01-10 14:33:19,365 Transforming feature data
2024-01-10 14:33:19,367 Transforming response data
2024-01-10 14:33:19,367 Transforming feature data
2024-01-10 14:33:19,369 Fitting random forest model
2024-01-10 14:33:20,442 Fold 0: training r2_score = 0.975

model_performance|0.975|0.291|0.723|0.753|0.827|0.707|430_14_26|./tmp_kcna3_rdkit_raw/curated_kcna3_ic50_model_1e31a914-724a-4282-b3fc-58d2e68e6830.tar.gz

rf_estimators: 194, rf_max_depth: 27, rf_max_feature: 45                         
 30%|███       | 3/10 [00:04<00:11,  1.58s/trial, best loss: 0.27644554399947097]

2024-01-10 14:33:21,080 ['dataset_hash'] are not part of the accepted list of parameters and will be ignored
2024-01-10 14:33:21,102 Created a dataset hash '389b161b7a4eb2304323a7dfddacfacc' from dataset_key '/usr/WS1/he6/code/ATOM/AMPL1.6/atomsci/ddm/examples/tutorials2023/dataset/curated_kcna3_ic50.csv'


num_model_tasks is deprecated and its value is ignored.                          
RF model with computed_descriptors and rdkit_raw                                 
 30%|███       | 3/10 [00:04<00:11,  1.58s/trial, best loss: 0.27644554399947097]

2024-01-10 14:33:21,125 Attempting to load featurized dataset
2024-01-10 14:33:21,158 Got dataset, attempting to extract data
2024-01-10 14:33:21,219 Creating deepchem dataset
2024-01-10 14:33:21,220 Using prefeaturized data; number of features = 200
2024-01-10 14:33:21,233 Previous dataset split restored
  X_m2 += dx * (X - X_means)

2024-01-10 14:33:21,247 Wrote transformers to ./tmp_kcna3_rdkit_raw/curated_kcna3_ic50/RF_computed_descriptors_scaffold_regression/075a3895-90ea-4a68-824e-e47ebfa50e2e/transformers.pkl
2024-01-10 14:33:21,247 Transforming response data
2024-01-10 14:33:21,248 Transforming feature data
  X = np.nan_to_num((X - self.X_means) * X_weight / self.X_stds)

2024-01-10 14:33:21,251 Transforming response data
2024-01-10 14:33:21,252 Transforming feature data
2024-01-10 14:33:21,254 Transforming response data
2024-01-10 14:33:21,254 Transforming feature data
2024-01-10 14:33:21,256 Fitting random forest model
2024-01-10 14:33:21,770 Fold 0: training r2_score = 0.975

model_performance|0.975|0.289|0.711|0.769|0.821|0.717|194_27_45|./tmp_kcna3_rdkit_raw/curated_kcna3_ic50_model_075a3895-90ea-4a68-824e-e47ebfa50e2e.tar.gz

rf_estimators: 386, rf_max_depth: 11, rf_max_feature: 186                        
 40%|████      | 4/10 [00:05<00:08,  1.36s/trial, best loss: 0.27644554399947097]

2024-01-10 14:33:22,104 ['dataset_hash'] are not part of the accepted list of parameters and will be ignored
2024-01-10 14:33:22,129 Created a dataset hash '389b161b7a4eb2304323a7dfddacfacc' from dataset_key '/usr/WS1/he6/code/ATOM/AMPL1.6/atomsci/ddm/examples/tutorials2023/dataset/curated_kcna3_ic50.csv'


num_model_tasks is deprecated and its value is ignored.                          
RF model with computed_descriptors and rdkit_raw                                 
 40%|████      | 4/10 [00:05<00:08,  1.36s/trial, best loss: 0.27644554399947097]

2024-01-10 14:33:22,151 Attempting to load featurized dataset
2024-01-10 14:33:22,186 Got dataset, attempting to extract data
2024-01-10 14:33:22,250 Creating deepchem dataset
2024-01-10 14:33:22,251 Using prefeaturized data; number of features = 200
2024-01-10 14:33:22,265 Previous dataset split restored
  X_m2 += dx * (X - X_means)

2024-01-10 14:33:22,278 Wrote transformers to ./tmp_kcna3_rdkit_raw/curated_kcna3_ic50/RF_computed_descriptors_scaffold_regression/20126de6-00f4-4f94-9bc9-fbedf1b1469a/transformers.pkl
2024-01-10 14:33:22,279 Transforming response data
2024-01-10 14:33:22,280 Transforming feature data
  X = np.nan_to_num((X - self.X_means) * X_weight / self.X_stds)

2024-01-10 14:33:22,282 Transforming response data
2024-01-10 14:33:22,283 Transforming feature data
2024-01-10 14:33:22,284 Transforming response data
2024-01-10 14:33:22,285 Transforming feature data
2024-01-10 14:33:22,286 Fitting random forest model
2024-01-10 14:33:23,271 Fold 0: training r2_score = 0.975

model_performance|0.975|0.290|0.721|0.756|0.820|0.719|386_11_186|./tmp_kcna3_rdkit_raw/curated_kcna3_ic50_model_20126de6-00f4-4f94-9bc9-fbedf1b1469a.tar.gz

rf_estimators: 467, rf_max_depth: 23, rf_max_feature: 86                         
 50%|█████     | 5/10 [00:07<00:07,  1.48s/trial, best loss: 0.27644554399947097]

2024-01-10 14:33:23,798 ['dataset_hash'] are not part of the accepted list of parameters and will be ignored
2024-01-10 14:33:23,820 Created a dataset hash '389b161b7a4eb2304323a7dfddacfacc' from dataset_key '/usr/WS1/he6/code/ATOM/AMPL1.6/atomsci/ddm/examples/tutorials2023/dataset/curated_kcna3_ic50.csv'


num_model_tasks is deprecated and its value is ignored.                          
RF model with computed_descriptors and rdkit_raw                                 
 50%|█████     | 5/10 [00:07<00:07,  1.48s/trial, best loss: 0.27644554399947097]

2024-01-10 14:33:23,841 Attempting to load featurized dataset
2024-01-10 14:33:23,874 Got dataset, attempting to extract data
2024-01-10 14:33:23,936 Creating deepchem dataset
2024-01-10 14:33:23,937 Using prefeaturized data; number of features = 200
2024-01-10 14:33:23,950 Previous dataset split restored
  X_m2 += dx * (X - X_means)

2024-01-10 14:33:23,962 Wrote transformers to ./tmp_kcna3_rdkit_raw/curated_kcna3_ic50/RF_computed_descriptors_scaffold_regression/304c3d79-4eba-4f0f-a2e9-adb8e92bfaa3/transformers.pkl
2024-01-10 14:33:23,963 Transforming response data
2024-01-10 14:33:23,964 Transforming feature data
  X = np.nan_to_num((X - self.X_means) * X_weight / self.X_stds)

2024-01-10 14:33:23,966 Transforming response data
2024-01-10 14:33:23,967 Transforming feature data
2024-01-10 14:33:23,968 Transforming response data
2024-01-10 14:33:23,969 Transforming feature data
2024-01-10 14:33:23,970 Fitting random forest model
2024-01-10 14:33:25,139 Fold 0: training r2_score = 0.976

model_performance|0.976|0.284|0.718|0.759|0.818|0.724|467_23_86|./tmp_kcna3_rdkit_raw/curated_kcna3_ic50_model_304c3d79-4eba-4f0f-a2e9-adb8e92bfaa3.tar.gz

rf_estimators: 148, rf_max_depth: 18, rf_max_feature: 157                        
 60%|██████    | 6/10 [00:09<00:06,  1.66s/trial, best loss: 0.27644554399947097]

2024-01-10 14:33:25,808 ['dataset_hash'] are not part of the accepted list of parameters and will be ignored
2024-01-10 14:33:25,829 Created a dataset hash '389b161b7a4eb2304323a7dfddacfacc' from dataset_key '/usr/WS1/he6/code/ATOM/AMPL1.6/atomsci/ddm/examples/tutorials2023/dataset/curated_kcna3_ic50.csv'


num_model_tasks is deprecated and its value is ignored.                          
RF model with computed_descriptors and rdkit_raw                                 
 60%|██████    | 6/10 [00:09<00:06,  1.66s/trial, best loss: 0.27644554399947097]

2024-01-10 14:33:25,850 Attempting to load featurized dataset
2024-01-10 14:33:25,881 Got dataset, attempting to extract data
2024-01-10 14:33:25,942 Creating deepchem dataset
2024-01-10 14:33:25,942 Using prefeaturized data; number of features = 200
2024-01-10 14:33:25,955 Previous dataset split restored
  X_m2 += dx * (X - X_means)

2024-01-10 14:33:25,967 Wrote transformers to ./tmp_kcna3_rdkit_raw/curated_kcna3_ic50/RF_computed_descriptors_scaffold_regression/8c5b367d-4679-416c-a441-7c3c3a8ef399/transformers.pkl
2024-01-10 14:33:25,967 Transforming response data
2024-01-10 14:33:25,968 Transforming feature data
  X = np.nan_to_num((X - self.X_means) * X_weight / self.X_stds)

2024-01-10 14:33:25,971 Transforming response data
2024-01-10 14:33:25,972 Transforming feature data
2024-01-10 14:33:25,973 Transforming response data
2024-01-10 14:33:25,974 Transforming feature data
2024-01-10 14:33:25,975 Fitting random forest model
2024-01-10 14:33:26,380 Fold 0: training r2_score = 0.976

model_performance|0.976|0.284|0.707|0.774|0.811|0.739|148_18_157|./tmp_kcna3_rdkit_raw/curated_kcna3_ic50_model_8c5b367d-4679-416c-a441-7c3c3a8ef399.tar.gz

rf_estimators: 250, rf_max_depth: 21, rf_max_feature: 16                         
 70%|███████   | 7/10 [00:10<00:04,  1.39s/trial, best loss: 0.27644554399947097]

2024-01-10 14:33:26,656 ['dataset_hash'] are not part of the accepted list of parameters and will be ignored
2024-01-10 14:33:26,679 Created a dataset hash '389b161b7a4eb2304323a7dfddacfacc' from dataset_key '/usr/WS1/he6/code/ATOM/AMPL1.6/atomsci/ddm/examples/tutorials2023/dataset/curated_kcna3_ic50.csv'


num_model_tasks is deprecated and its value is ignored.                          
RF model with computed_descriptors and rdkit_raw                                 
 70%|███████   | 7/10 [00:10<00:04,  1.39s/trial, best loss: 0.27644554399947097]

2024-01-10 14:33:26,701 Attempting to load featurized dataset
2024-01-10 14:33:26,737 Got dataset, attempting to extract data
2024-01-10 14:33:26,804 Creating deepchem dataset
2024-01-10 14:33:26,805 Using prefeaturized data; number of features = 200
2024-01-10 14:33:26,820 Previous dataset split restored
  X_m2 += dx * (X - X_means)

2024-01-10 14:33:26,833 Wrote transformers to ./tmp_kcna3_rdkit_raw/curated_kcna3_ic50/RF_computed_descriptors_scaffold_regression/879457eb-19ba-4dc6-9581-ed2b1e0dcc4b/transformers.pkl
2024-01-10 14:33:26,834 Transforming response data
2024-01-10 14:33:26,834 Transforming feature data
  X = np.nan_to_num((X - self.X_means) * X_weight / self.X_stds)

2024-01-10 14:33:26,837 Transforming response data
2024-01-10 14:33:26,838 Transforming feature data
2024-01-10 14:33:26,839 Transforming response data
2024-01-10 14:33:26,840 Transforming feature data
2024-01-10 14:33:26,841 Fitting random forest model
2024-01-10 14:33:27,488 Fold 0: training r2_score = 0.975

model_performance|0.975|0.286|0.717|0.761|0.830|0.699|250_21_16|./tmp_kcna3_rdkit_raw/curated_kcna3_ic50_model_879457eb-19ba-4dc6-9581-ed2b1e0dcc4b.tar.gz

rf_estimators: 85, rf_max_depth: 18, rf_max_feature: 67                          
 80%|████████  | 8/10 [00:11<00:02,  1.35s/trial, best loss: 0.27644554399947097]

2024-01-10 14:33:27,902 ['dataset_hash'] are not part of the accepted list of parameters and will be ignored
2024-01-10 14:33:27,924 Created a dataset hash '389b161b7a4eb2304323a7dfddacfacc' from dataset_key '/usr/WS1/he6/code/ATOM/AMPL1.6/atomsci/ddm/examples/tutorials2023/dataset/curated_kcna3_ic50.csv'


num_model_tasks is deprecated and its value is ignored.                          
RF model with computed_descriptors and rdkit_raw                                 
 80%|████████  | 8/10 [00:11<00:02,  1.35s/trial, best loss: 0.27644554399947097]

2024-01-10 14:33:27,949 Attempting to load featurized dataset
2024-01-10 14:33:27,984 Got dataset, attempting to extract data
2024-01-10 14:33:28,046 Creating deepchem dataset
2024-01-10 14:33:28,047 Using prefeaturized data; number of features = 200
2024-01-10 14:33:28,060 Previous dataset split restored
  X_m2 += dx * (X - X_means)

2024-01-10 14:33:28,073 Wrote transformers to ./tmp_kcna3_rdkit_raw/curated_kcna3_ic50/RF_computed_descriptors_scaffold_regression/06c3817f-24c2-4869-983b-bbba4973a8ee/transformers.pkl
2024-01-10 14:33:28,074 Transforming response data
2024-01-10 14:33:28,074 Transforming feature data
  X = np.nan_to_num((X - self.X_means) * X_weight / self.X_stds)

2024-01-10 14:33:28,077 Transforming response data
2024-01-10 14:33:28,078 Transforming feature data
2024-01-10 14:33:28,079 Transforming response data
2024-01-10 14:33:28,079 Transforming feature data
2024-01-10 14:33:28,081 Fitting random forest model
2024-01-10 14:33:28,336 Fold 0: training r2_score = 0.976

model_performance|0.976|0.286|0.711|0.769|0.823|0.714|85_18_67|./tmp_kcna3_rdkit_raw/curated_kcna3_ic50_model_06c3817f-24c2-4869-983b-bbba4973a8ee.tar.gz

rf_estimators: 509, rf_max_depth: 15, rf_max_feature: 61                         
 90%|█████████ | 9/10 [00:11<00:01,  1.12s/trial, best loss: 0.27644554399947097]

2024-01-10 14:33:28,530 ['dataset_hash'] are not part of the accepted list of parameters and will be ignored
2024-01-10 14:33:28,552 Created a dataset hash '389b161b7a4eb2304323a7dfddacfacc' from dataset_key '/usr/WS1/he6/code/ATOM/AMPL1.6/atomsci/ddm/examples/tutorials2023/dataset/curated_kcna3_ic50.csv'


num_model_tasks is deprecated and its value is ignored.                          
RF model with computed_descriptors and rdkit_raw                                 
 90%|█████████ | 9/10 [00:12<00:01,  1.12s/trial, best loss: 0.27644554399947097]

2024-01-10 14:33:28,573 Attempting to load featurized dataset
2024-01-10 14:33:28,607 Got dataset, attempting to extract data
2024-01-10 14:33:28,670 Creating deepchem dataset
2024-01-10 14:33:28,671 Using prefeaturized data; number of features = 200
2024-01-10 14:33:28,684 Previous dataset split restored
  X_m2 += dx * (X - X_means)

2024-01-10 14:33:28,696 Wrote transformers to ./tmp_kcna3_rdkit_raw/curated_kcna3_ic50/RF_computed_descriptors_scaffold_regression/c9e8fcd1-7013-406d-95e3-889d2417e65c/transformers.pkl
2024-01-10 14:33:28,697 Transforming response data
2024-01-10 14:33:28,698 Transforming feature data
  X = np.nan_to_num((X - self.X_means) * X_weight / self.X_stds)

2024-01-10 14:33:28,701 Transforming response data
2024-01-10 14:33:28,701 Transforming feature data
2024-01-10 14:33:28,702 Transforming response data
2024-01-10 14:33:28,703 Transforming feature data
2024-01-10 14:33:28,704 Fitting random forest model
2024-01-10 14:33:29,969 Fold 0: training r2_score = 0.976

model_performance|0.976|0.285|0.719|0.759|0.827|0.706|509_15_61|./tmp_kcna3_rdkit_raw/curated_kcna3_ic50_model_c9e8fcd1-7013-406d-95e3-889d2417e65c.tar.gz

100%|██████████| 10/10 [00:14<00:00,  1.42s/trial, best loss: 0.27644554399947097]
Generating the performance -- iteration table and Copy the best model tarball.
Best model: ./tmp_kcna3_rdkit_raw/curated_kcna3_ic50_model_d629b010-081d-4ac2-816e-e591bd0a99e1.tar.gz, valid R2: 0.723554456000529


The best model will be saved in `output_kcna3_rdkit_raw` along with a csv file
containing regression performance for all trained models.

The rest of the models are saved in `tmp_kcna3_rdkit_raw`. These models can be
explored using `get_filesystem_perf_results`.

In [2]:
import atomsci.ddm.pipeline.compare_models as cm

result_df = cm.get_filesystem_perf_results(
    result_dir='tmp_kcna3_rdkit_raw',
    pred_type='regression'
)

# sort by validation r2 score to find the best model
result_df = result_df.sort_values(by='best_valid_r2_score', ascending=False)



Found data for 10 models under tmp_kcna3_rdkit_raw


The column `model_parameters_dict` contains hyperparamters used for the best model.

In [3]:
result_df.iloc[0].model_parameters_dict

'{"rf_estimators": 254, "rf_max_depth": 23, "rf_max_features": 106}'