# Train a Simple Regression Model

The process of training an ML model involves providing an ML algorithm (that is, the learning algorithm) with training data to learn from. The term ML model refers to the model artifact that is created by the training process. The goal of training a Regression Model is to find those values of weights against which loss function can be minimized i. e difference between the predicted values and the true labels is minimized as much as possible.

This tutorial will detial how we can use AMPL tools to train a regression model to predict the pIC50 values of the kcna5 target assay. First, we will introduce RDkit features. Next, we will train a Random Forest model using rdkit features of the curated kcna5 data; split the dataset (or use already generated split file); explain the use of descriptors; evaluate the performance of the model; save the model as a .targz file in a preffered location for easy retrieval.

In [1]:
import pandas as pd

# Set up
dataset_file = 'dataset/curated_kcna5_ic50.csv'
odir='dataset'

# RDkit features

RDKit is an open source toolkit for cheminformatics. It is a collection of cheminformatics and machine-learning software written in C++ and Python. Let us see how to calculate descriptors using RDkit.

In [2]:
# Read the dataset
df = pd.read_csv(dataset_file)

In [3]:
#Calculate descriptors using RDkit

from rdkit.Chem import AllChem
from rdkit import Chem
from rdkit.Chem import Descriptors
from rdkit.ML.Descriptors import MoleculeDescriptors

def RDkit_descriptors(smiles):
    mols = [Chem.MolFromSmiles(i) for i in smiles] 
    calc = MoleculeDescriptors.MolecularDescriptorCalculator([x[0] for x in Descriptors._descList])
    desc_names = calc.GetDescriptorNames()
    
    Mol_descriptors =[]
    for mol in mols:
        # add hydrogens to molecules
        mol=Chem.AddHs(mol)
        # Calculate all 200 descriptors for each molecule
        descriptors = calc.CalcDescriptors(mol)
        Mol_descriptors.append(descriptors)
    return Mol_descriptors,desc_names 

# Function call
Mol_descriptors,desc_names = RDkit_descriptors(df['base_rdkit_smiles'])

There are a variety of descriptor options that rdkit and ampl provide. For demonstration purposes, we choose to use rdkit features in this tutorial.

In [4]:
# View the descriptors
df_with_descriptors = pd.DataFrame(Mol_descriptors,columns=desc_names)
df_with_descriptors

Unnamed: 0,MaxEStateIndex,MinEStateIndex,MaxAbsEStateIndex,MinAbsEStateIndex,qed,MolWt,HeavyAtomMolWt,ExactMolWt,NumValenceElectrons,NumRadicalElectrons,...,fr_sulfide,fr_sulfonamd,fr_sulfone,fr_term_acetylene,fr_tetrazole,fr_thiazole,fr_thiocyan,fr_thiophene,fr_unbrch_alkane,fr_urea
0,13.146942,-4.451515,13.146942,0.198952,0.819319,360.483,336.291,360.161997,134,0,...,0,0,0,0,0,1,0,0,0,0
1,13.887482,-4.131533,13.887482,0.826648,0.535035,429.520,402.304,429.205242,164,0,...,0,0,0,0,0,0,0,0,0,0
2,14.824199,-5.573055,14.824199,0.616859,0.610704,456.499,435.331,456.126754,166,0,...,0,1,0,0,0,0,0,0,0,0
3,14.880656,-6.737257,14.880656,1.161053,0.309411,617.690,583.418,617.217127,230,0,...,0,1,0,0,0,0,0,0,0,0
4,14.634979,-6.703996,14.634979,0.580561,0.538487,494.613,464.373,494.187543,184,0,...,0,1,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
797,13.156727,-4.460282,13.156727,0.195972,0.757681,376.482,352.290,376.156912,140,0,...,0,0,0,0,0,1,0,0,0,0
798,14.806492,-5.664155,14.806492,0.762158,0.707533,465.497,443.321,465.133397,170,0,...,0,1,0,0,0,0,0,0,0,0
799,13.620699,-5.665571,13.620699,0.046920,0.299894,480.977,459.809,480.102289,168,0,...,0,1,0,0,0,0,0,0,0,0
800,14.853877,-4.507663,14.853877,0.062519,0.774867,405.329,383.153,404.117067,142,0,...,0,0,0,0,0,0,0,0,0,0


# Model Training (using already split data)

We will use the curated dataset that we created in tutorial 2 and the split file we created in tutorial 3 and build a json file for training. We set "previously_split": "True and add the split_uuid. Here, we will use "split_uuid" : "bcd96299-6d61-4467-9e6b-814dcf8cde16"; the uuid for the scaffold split created in tutorial 3.

When the featurized dataset is not previously saved for curated_kcna5_ic50, AMPL will create a featurized dataset and save it in a folder called scaled_descriptors as a csv file : dataset/scaled_descriptors/curated_kcna5_ic50_with_rdkit_raw_descriptors.csv'

In [5]:
# importing relevant libraries
from atomsci.ddm.pipeline import model_pipeline as mp
from atomsci.ddm.pipeline import parameter_parser as parse

# Set up
dataset_file = 'dataset/curated_kcna5_ic50.csv'
odir='dataset'

response_col = "avg_pIC50"
compound_id = "compound_id"
smiles_col = "base_rdkit_smiles"

params = {
        "verbose": "True",
        "system": "LC",
        "datastore": "False",
        "save_results": "False",
        "prediction_type": "regression",
        "dataset_key": dataset_file,
        "id_col": compound_id,
        "smiles_col": smiles_col,
        "response_cols": response_col,
        "previously_split": "True",
        "split_uuid" : "bcd96299-6d61-4467-9e6b-814dcf8cde16",
        "split_only": "False",
        "featurizer": "computed_descriptors",
        "descriptor_type" : "rdkit_raw",
        "model_type": "RF",
        "verbose": "True",
        "transformers": "True",
        'max_epochs': '70',
        "rerun": "False",
        "result_dir": odir
    }

ampl_param = parse.wrapper(params)
pl = mp.ModelPipeline(ampl_param)
pl.train_model()

  from .autonotebook import tqdm as notebook_tqdm
Skipped loading modules with pytorch-geometric dependency, missing a dependency. No module named 'torch_geometric'
Skipped loading modules with pytorch-geometric dependency, missing a dependency. cannot import name 'DMPNN' from 'deepchem.models.torch_models' (/usr/WS1/hiran/ampl160/lib/python3.9/site-packages/deepchem/models/torch_models/__init__.py)
Skipped loading modules with pytorch-lightning dependency, missing a dependency. No module named 'pytorch_lightning'
Skipped loading some Jax models, missing a dependency. jax requires jaxlib to be installed. See https://github.com/google/jax#installation for installation instructions.
INFO:ATOM:Created a dataset hash 'b7b8b0a25c13147093936f806045e0ac' from dataset_key '/usr/WS1/hiran/AMPL/atomsci/ddm/examples/tutorials2023/dataset/curated_kcna5_ic50.csv'
INFO:ATOM:Reading descriptor spec table from /usr/WS1/hiran/AMPL/atomsci/ddm/data/descriptor_sets_sources_by_descr_type.csv
DEBUG:ATOM:At

# Explore the model tar and metadata files

In [6]:
!tar -tf dataset/*.tar.gz

./best_model/
./best_model/model.joblib
./model_metadata.json
./model_metrics.json
./transformers.pkl


In [7]:
!tar xzf dataset/*.tar.gz -C /tmp

In [8]:
!cat /tmp/model_metadata.json | head

{
    "descriptor_specific": {
        "descriptor_bucket": "public",
        "descriptor_key": null,
        "descriptor_type": "rdkit_raw"
    },
    "model_parameters": {
        "ampl_version": "1.5.1",
        "class_number": 2,
        "featurizer": "computed_descriptors",


In [9]:
import joblib
# load the model from disk
loaded_model = joblib.load("/tmp/best_model/model.joblib")
loaded_model

RandomForestRegressor(max_features=32, n_estimators=500, n_jobs=-1)

In [10]:
import json
import pandas as pd
data = json.load(open('/tmp/model_metadata.json'))
data['model_parameters']

{'ampl_version': '1.5.1',
 'class_number': 2,
 'featurizer': 'computed_descriptors',
 'hyperparam_uuid': None,
 'model_bucket': 'public',
 'model_choice_score_type': 'r2',
 'model_type': 'RF',
 'num_model_tasks': 1,
 'prediction_type': 'regression',
 'save_results': False,
 'system': 'LC',
 'time_generated': 1700520862.9832964,
 'transformer_bucket': '',
 'transformer_key': 'dataset/curated_kcna5_ic50/RF_computed_descriptors_scaffold_regression/d14f3390-a092-4c09-998b-b0aa6455fecc/transformers.pkl',
 'transformer_oid': '',
 'transformers': True,
 'uncertainty': True}

# Model Training (Split data and train)

Let us look at how we split the dataset and then train. Here, we set "previously_split": "False" and not have a split_uuid parameter. AMPL splits the data by the type of split specified in the splitter parameter (here,scaffold) and writes the split file in dataset/curated_kcna5_ic50_train_valid_test_scaffold_{split_uuid}.csv. After training, AMPL saves the model and all of its parameters as a tarball in the result_dir.

In [11]:
response_col = "avg_pIC50"
compound_id = "compound_id"
smiles_col = "base_rdkit_smiles"

params = {
        "verbose": "True",
        "system": "LC",
        "datastore": "False",
        "save_results": "False",
        "prediction_type": "regression",
        "dataset_key": dataset_file,
        "id_col": compound_id,
        "smiles_col": smiles_col,
        "response_cols": response_col,
        "previously_split": "False",
        "split_only": "False",
        "splitter": "scaffold",
        "split_valid_frac": "0.15",
        "split_test_frac": "0.15",
        "featurizer": "computed_descriptors",
        "descriptor_type" : "rdkit_raw",
        "model_type": "RF",
        "verbose": "True",
        "transformers": "True",
        'max_epochs': '70',
        "rerun": "False",
        "result_dir": odir
    }

ampl_param = parse.wrapper(params)
pl = mp.ModelPipeline(ampl_param)
pl.train_model()

INFO:ATOM:Created a dataset hash 'b7b8b0a25c13147093936f806045e0ac' from dataset_key '/usr/WS1/hiran/AMPL/atomsci/ddm/examples/tutorials2023/dataset/curated_kcna5_ic50.csv'
DEBUG:ATOM:Attempting to load featurized dataset
DEBUG:ATOM:Got dataset, attempting to extract data
DEBUG:ATOM:Creating deepchem dataset
INFO:ATOM:Using prefeaturized data; number of features = 200
INFO:ATOM:Wrote transformers to dataset/curated_kcna5_ic50/RF_computed_descriptors_scaffold_regression/75f7f1da-a332-40ae-a7ac-0a8f09a9b84b/transformers.pkl
INFO:ATOM:Transforming response data
INFO:ATOM:Transforming feature data
  X = np.nan_to_num((X - self.X_means) * X_weight / self.X_stds)
INFO:ATOM:Transforming response data
INFO:ATOM:Transforming feature data
INFO:ATOM:Transforming response data
INFO:ATOM:Transforming feature data
INFO:ATOM:Fitting random forest model
INFO:ATOM:Fold 0: training r2_score = 0.941, validation r2_score = 0.334, test r2_score = 0.250
INFO:ATOM:Wrote model tarball to dataset/curated_kcna5

# Performance of the model
Model performance in machine learning is a measurement of how accurate predictions a model makes on new, unseen data are. We typically measure model performance using a test set, where you compare the predictions on the test set to the actual outcomes.
Performance metrics are a part of every machine learning pipeline. They tell you if you’re making progress, and put a number on it.
Regression models have continuous output. So, we need a metric based on calculating some sort of distance between predictions and ground truth.

Popular metrics to evaluate Regression models are Mean Absolute Error (MAE), Mean Squared Error (MSE), Root Mean Squared Error (RMSE) and R² (R-Squared). We will compare the R2 scores of our models; our top model is one which has the maximum R2 score on the validation set.

In [12]:
# Model Performance
from atomsci.ddm.pipeline import compare_models as cm
pred_df = cm.get_filesystem_perf_results(odir, pred_type='regression')



Found data for 2 models under dataset


The pred_df dataframe has details about the model_uuid, model_path, ampl_version, model_type, features, splitter and the results for popular metrics that help evaluate the performance. Let us view the contents of the pred_df dataframe.

In [13]:
# View the pred_df dataframe
pred_df

Unnamed: 0,model_uuid,model_path,ampl_version,model_type,dataset_key,features,splitter,model_score_type,feature_transform_type,model_choice_score,...,rf_max_depth,max_epochs,best_epoch,learning_rate,layer_sizes,dropouts,xgb_gamma,xgb_learning_rate,model_parameters_dict,feat_parameters_dict
0,d14f3390-a092-4c09-998b-b0aa6455fecc,dataset/curated_kcna5_ic50_model_d14f3390-a092...,1.5.1,RF,/usr/WS1/hiran/AMPL/atomsci/ddm/examples/tutor...,rdkit_raw,scaffold,r2,normalization,0.373714,...,,,,,,,,,"{""rf_estimators"": 500, ""rf_max_depth"": null, ""...",{}
1,75f7f1da-a332-40ae-a7ac-0a8f09a9b84b,dataset/curated_kcna5_ic50_model_75f7f1da-a332...,1.5.1,RF,/usr/WS1/hiran/AMPL/atomsci/ddm/examples/tutor...,rdkit_raw,scaffold,r2,normalization,0.334053,...,,,,,,,,,"{""rf_estimators"": 500, ""rf_max_depth"": null, ""...",{}


# Top Performing Model
To pick the top performing model, we sort the R2 scores on the validation set in descending order and pick the one that is maximum.

In [14]:
# Top performing model
top_model=pred_df.sort_values(by="best_valid_r2_score", ascending=False).iloc[0,:]
top_model

model_uuid                               d14f3390-a092-4c09-998b-b0aa6455fecc
model_path                  dataset/curated_kcna5_ic50_model_d14f3390-a092...
ampl_version                                                            1.5.1
model_type                                                                 RF
dataset_key                 /usr/WS1/hiran/AMPL/atomsci/ddm/examples/tutor...
features                                                            rdkit_raw
splitter                                                             scaffold
model_score_type                                                           r2
feature_transform_type                                          normalization
model_choice_score                                                   0.373714
best_train_r2_score                                                  0.941051
best_train_rms_score                                                 0.198123
best_train_mae_score                                            

# Model tarball 
The model_path or the location of the tarball where the top performing model is saved is in top_model.model_path.

In [15]:
# Top performing model path
top_model.model_path

'dataset/curated_kcna5_ic50_model_d14f3390-a092-4c09-998b-b0aa6455fecc.tar.gz'