# Train a Simple Regression Model

The process of training an ML model involves providing an ML algorithm (that is, the learning algorithm) with training data to learn from. The term ML model refers to the model artifact that is created by the training process. The goal of training a Regression Model is to find those values of weights against which loss function can be minimized i. e difference between the predicted values and the true labels is minimized as much as possible.

This tutorial will detial how we can use AMPL tools to train a regression model to predict the pIC50 values of the kcna5 target assay. We will train a Random Forest model using rdkit features of the curated kcna5 data; split the dataset (or use already generated split file); explain the use of descriptors; evaluate the performance of the model; save the model as a .targz file in a preffered location for easy retrieval.

Please note that training a Random Forest model and splitting the dataset are inherently non-deterministic. You may obtain a different Random Forest model by running this tutorial each time.

# Model Training (using already split data)

We will use the curated dataset that we created in tutorial 2 and the split file we created in tutorial 3 and build a json file for training. We set "previously_split": "True and add the split_uuid. Here, we will use "split_uuid" : "bcd96299-6d61-4467-9e6b-814dcf8cde16"; the uuid for the scaffold split created in tutorial 3.

AMPL provides an extensible featurization module that can generate a variety of molecular feature types, given SMILES strings as input. For demonstration purposes, we choose to use rdkit features in this tutorial.

When the featurized dataset is not previously saved for curated_kcna5_ic50, AMPL will create a featurized dataset and save it in a folder called scaled_descriptors as a csv file : dataset/scaled_descriptors/curated_kcna5_ic50_with_rdkit_raw_descriptors.csv'

In [2]:
# importing relevant libraries
import pandas as pd
from atomsci.ddm.pipeline import model_pipeline as mp
from atomsci.ddm.pipeline import parameter_parser as parse

# Set up
dataset_file = 'dataset/curated_kcna5_ic50.csv'
odir='dataset'

response_col = "avg_pIC50"
compound_id = "compound_id"
smiles_col = "base_rdkit_smiles"

params = {
        "verbose": "True",
        "system": "LC",
        "datastore": "False",
        "save_results": "False",
        "prediction_type": "regression",
        "dataset_key": dataset_file,
        "id_col": compound_id,
        "smiles_col": smiles_col,
        "response_cols": response_col,
        "previously_split": "True",
        "split_uuid" : "bcd96299-6d61-4467-9e6b-814dcf8cde16",
        "split_only": "False",
        "featurizer": "computed_descriptors",
        "descriptor_type" : "rdkit_raw",
        "model_type": "RF",
        "verbose": "True",
        "transformers": "True",
        'max_epochs': '70',
        "rerun": "False",
        "result_dir": odir
    }

ampl_param = parse.wrapper(params)
pl = mp.ModelPipeline(ampl_param)
pl.train_model()

  from .autonotebook import tqdm as notebook_tqdm
Skipped loading modules with pytorch-geometric dependency, missing a dependency. No module named 'torch_geometric'
Skipped loading modules with pytorch-geometric dependency, missing a dependency. cannot import name 'DMPNN' from 'deepchem.models.torch_models' (/gpfs/gsfs12/users/lup2/AMPL/ampl_tutorials/lib/python3.8/site-packages/deepchem/models/torch_models/__init__.py)
Skipped loading modules with pytorch-lightning dependency, missing a dependency. No module named 'pytorch_lightning'
Skipped loading some Jax models, missing a dependency. jax requires jaxlib to be installed. See https://github.com/google/jax#installation for installation instructions.
DEBUG:ATOM:Model tracker client not supported in your environment; will save models in filesystem only.
INFO:ATOM:Created a dataset hash 'd73e30e5b0ddf05e34665d76e5c62d27' from dataset_key '/gpfs/gsfs12/users/lup2/AMPL/AMPL_setup_tutorials/atomsci/ddm/examples/tutorials2023/dataset/curate

# Model Training (Split data and train)

Let us look at how we split the dataset and then train. Here, we set "previously_split": "False" and not have a split_uuid parameter. AMPL splits the data by the type of split specified in the splitter parameter (here,scaffold) and writes the split file in dataset/curated_kcna5_ic50_train_valid_test_scaffold_{split_uuid}.csv. After training, AMPL saves the model and all of its parameters as a tarball in the result_dir.

In [3]:
response_col = "avg_pIC50"
compound_id = "compound_id"
smiles_col = "base_rdkit_smiles"

params = {
        "verbose": "True",
        "system": "LC",
        "datastore": "False",
        "save_results": "False",
        "prediction_type": "regression",
        "dataset_key": dataset_file,
        "id_col": compound_id,
        "smiles_col": smiles_col,
        "response_cols": response_col,
        "previously_split": "False",
        "split_only": "False",
        "splitter": "scaffold",
        "split_valid_frac": "0.15",
        "split_test_frac": "0.15",
        "featurizer": "computed_descriptors",
        "descriptor_type" : "rdkit_raw",
        "model_type": "RF",
        "verbose": "True",
        "transformers": "True",
        'max_epochs': '70',
        "rerun": "False",
        "result_dir": odir
    }

ampl_param = parse.wrapper(params)
pl = mp.ModelPipeline(ampl_param)
pl.train_model()

INFO:ATOM:Created a dataset hash 'd73e30e5b0ddf05e34665d76e5c62d27' from dataset_key '/gpfs/gsfs12/users/lup2/AMPL/AMPL_setup_tutorials/atomsci/ddm/examples/tutorials2023/dataset/curated_kcna5_ic50.csv'
DEBUG:ATOM:Attempting to load featurized dataset
DEBUG:ATOM:Got dataset, attempting to extract data
DEBUG:ATOM:Creating deepchem dataset
INFO:ATOM:Using prefeaturized data; number of features = 200
INFO:ATOM:Wrote transformers to dataset/curated_kcna5_ic50/RF_computed_descriptors_scaffold_regression/2d42a29e-4d10-4c88-abfa-3c811eae05ef/transformers.pkl
INFO:ATOM:Transforming response data
INFO:ATOM:Transforming feature data
  X = np.nan_to_num((X - self.X_means) * X_weight / self.X_stds)
INFO:ATOM:Transforming response data
INFO:ATOM:Transforming feature data
INFO:ATOM:Transforming response data
INFO:ATOM:Transforming feature data
INFO:ATOM:Fitting random forest model
INFO:ATOM:Fold 0: training r2_score = 0.941, validation r2_score = 0.195, test r2_score = 0.396
INFO:ATOM:Wrote model ta

# Performance of the model
Model performance in machine learning is a measurement of how accurate predictions a model makes on new, unseen data are. We typically measure model performance using a test set, where you compare the predictions on the test set to the actual outcomes.
Performance metrics are a part of every machine learning pipeline. They tell you if you’re making progress, and put a number on it.
Regression models have continuous output. So, we need a metric based on calculating some sort of distance between predictions and ground truth.

Popular metrics to evaluate Regression models are Mean Absolute Error (MAE), Mean Squared Error (MSE), Root Mean Squared Error (RMSE) and R² (R-Squared). We will compare the R2 scores of our models; our top model is one which has the maximum R2 score on the validation set.

Please note that the model tracker client will not be supported in your environment.

In [4]:
# Model Performance
from atomsci.ddm.pipeline import compare_models as cm
pred_df = cm.get_filesystem_perf_results(odir, pred_type='regression')

DEBUG:ATOM:Model tracker client not supported in your environment; can look at models in filesystem only.


Found data for 2 models under dataset


The pred_df dataframe has details about the model_uuid, model_path, ampl_version, model_type, features, splitter and the results for popular metrics that help evaluate the performance. Let us view the contents of the pred_df dataframe.

In [5]:
pred_df.to_csv('./dataset/pred_df.csv')

In [6]:
# View the pred_df dataframe
pred_df

Unnamed: 0,model_uuid,model_path,ampl_version,model_type,dataset_key,features,splitter,model_score_type,feature_transform_type,model_choice_score,...,rf_max_depth,max_epochs,best_epoch,learning_rate,layer_sizes,dropouts,xgb_gamma,xgb_learning_rate,model_parameters_dict,feat_parameters_dict
0,ebc39cab-fc9f-4238-827e-241850cee82b,dataset/curated_kcna5_ic50_model_ebc39cab-fc9f...,1.6.0,RF,/gpfs/gsfs12/users/lup2/AMPL/AMPL_setup_tutori...,rdkit_raw,scaffold,r2,normalization,0.369536,...,,,,,,,,,"{""rf_estimators"": 500, ""rf_max_depth"": null, ""...",{}
1,2d42a29e-4d10-4c88-abfa-3c811eae05ef,dataset/curated_kcna5_ic50_model_2d42a29e-4d10...,1.6.0,RF,/gpfs/gsfs12/users/lup2/AMPL/AMPL_setup_tutori...,rdkit_raw,scaffold,r2,normalization,0.194591,...,,,,,,,,,"{""rf_estimators"": 500, ""rf_max_depth"": null, ""...",{}


# Top Performing Model
To pick the top performing model, we sort the R2 scores on the validation set in descending order and pick the one that is maximum.

In [7]:
# Top performing model
top_model=pred_df.sort_values(by="best_valid_r2_score", ascending=False).iloc[0,:]
top_model

model_uuid                               ebc39cab-fc9f-4238-827e-241850cee82b
model_path                  dataset/curated_kcna5_ic50_model_ebc39cab-fc9f...
ampl_version                                                            1.6.0
model_type                                                                 RF
dataset_key                 /gpfs/gsfs12/users/lup2/AMPL/AMPL_setup_tutori...
features                                                            rdkit_raw
splitter                                                             scaffold
model_score_type                                                           r2
feature_transform_type                                          normalization
model_choice_score                                                   0.369536
best_train_r2_score                                                  0.941555
best_train_rms_score                                                 0.197274
best_train_mae_score                                            

# Model tarball 
The model_path or the location of the tarball where the top performing model is saved is in top_model.model_path.

In [8]:
# Top performing model path
top_model.model_path

'dataset/curated_kcna5_ic50_model_ebc39cab-fc9f-4238-827e-241850cee82b.tar.gz'