# Train a Simple Regression Model

The process of training a machine learning (ML) model can be thought of as fitting a
highly parameterized function to map inputs to outputs. An ML algorithm needs to learn from
numerous examples of input and output pairs to accurately map an input to an output,
i. e., make a prediction. After training, the result is referred to a trained ML model or an artifact.

This tutorial will detail how we can use [AMPL](https://github.com/ATOMScience-org/AMPL) tools to train a regression model to predict 
how much a compound will inhibit the KCNA3 protein as measured by pIC50. 
We will train a random forest model using the following inputs:

1. The curated kcna3 dataset from **tutorial 2**.
2. The split file generated in **tutorial 3**.
3. [RDKit](https://github.com/rdkit/rdkit) features calculated by the [AMPL](https://github.com/ATOMScience-org/AMPL) pipeline.

We will explain the use of descriptors, how to evaulate model performance,
and where the model is saved as a .tar.gz file.

> **Note** *Training a random forest model and splitting the dataset are non-deterministic. 
You will obtain a slightly different random forest model by running this tutorial each time.*

## Model Training (using already split data)

We will use the curated dataset created in **tutorial 2** and the split file 
created in **tutorial 3** to build a json file for training. We set `"previously_split": "True"`
 and set the `split_uuid`. 
Here, we will use `"split_uuid" : "8daa5687-c2ee-45e4-b385-36164246c419"; 
the uuid for the scaffold split created in **tutorial 3**.

[AMPL](https://github.com/ATOMScience-org/AMPL) provides an extensive featurization module that can generate a 
variety of molecular feature types, given SMILES strings as input. 
For demonstration purposes, we choose to use RDKit features in this tutorial.

When the featurized dataset is not previously saved for curated_kcna3_ic50, 
[AMPL](https://github.com/ATOMScience-org/AMPL) will create a featurized dataset and save it in a folder called `scaled_descriptors` 
as a csv file e.g. `dataset/scaled_descriptors/curated_kcna3_ic50_with_rdkit_raw_descriptors.csv`

In [1]:
# importing relevant libraries
import pandas as pd
from atomsci.ddm.pipeline import model_pipeline as mp
from atomsci.ddm.pipeline import parameter_parser as parse

# Set up
dataset_file = 'dataset/curated_kcna3_ic50.csv'
odir='dataset'

response_col = "avg_pIC50"
compound_id = "compound_id"
smiles_col = "base_rdkit_smiles"
split_uuid = "8daa5687-c2ee-45e4-b385-36164246c419"

params = {
        "verbose": "True",
        "system": "LC",
        "datastore": "False",
        "save_results": "False",
        "prediction_type": "regression",
        "dataset_key": dataset_file,
        "id_col": compound_id,
        "smiles_col": smiles_col,
        "response_cols": response_col,
        "previously_split": "True",
        "split_uuid" : split_uuid,
        "split_only": "False",
        "featurizer": "computed_descriptors",
        "descriptor_type" : "rdkit_raw",
        "model_type": "RF",
        "verbose": "True",
        "transformers": "True",
        "rerun": "False",
        "result_dir": odir
    }

ampl_param = parse.wrapper(params)
pl = mp.ModelPipeline(ampl_param)
pl.train_model()

Skipped loading some Jax models, missing a dependency. No module named 'jax'
  X = np.nan_to_num((X - self.X_means) * X_weight / self.X_stds)


## Model Training (Split data and train)

It is possible to split and train a model in one step. 
Here, we set `"previously_split": "False"` and don't specify a split_uuid parameter. 
[AMPL](https://github.com/ATOMScience-org/AMPL) splits the data by the type of split specified in the splitter parameter, 
scaffold in this example, and writes the split file in
`dataset/curated_kcna3_ic50_train_valid_test_scaffold_{split_uuid}.csv.` 
After training, [AMPL](https://github.com/ATOMScience-org/AMPL) saves the model and all of its parameters as a tarball in `result_dir`.

In [2]:
response_col = "avg_pIC50"
compound_id = "compound_id"
smiles_col = "base_rdkit_smiles"

params = {
        "verbose": "True",
        "system": "LC",
        "datastore": "False",
        "save_results": "False",
        "prediction_type": "regression",
        "dataset_key": dataset_file,
        "id_col": compound_id,
        "smiles_col": smiles_col,
        "response_cols": response_col,
        "previously_split": "False",
        "split_only": "False",
        "splitter": "scaffold",
        "split_valid_frac": "0.15",
        "split_test_frac": "0.15",
        "featurizer": "computed_descriptors",
        "descriptor_type" : "rdkit_raw",
        "model_type": "RF",
        "verbose": "True",
        "transformers": "True",
        "rerun": "False",
        "result_dir": odir
    }

ampl_param = parse.wrapper(params)
pl = mp.ModelPipeline(ampl_param)
pl.train_model()

  X = np.nan_to_num((X - self.X_means) * X_weight / self.X_stds)


## Performance of the Model
We evaluate model performance by measuring how accurate 
model predictions are on validation and test sets. 
The validation set is used while optimizing the model and for choosing the best
parameter settings. Then the performance on the test set is the final judge of
model performance.

AMPL has several popular metrics to evaulate regression models; 
Mean Absolute Error (MAE), Mean Squared Error (MSE), Root Mean Squared Error (RMSE) and R² (R-Squared).
In our tutorials, we will use R² metric to compare our models. The best model will have the highest
R² score.

> **Note** *The model tracker client will not be supported in your environment.*

In [14]:
# Model Performance
from atomsci.ddm.pipeline import compare_models as cm

pred_df = cm.get_filesystem_perf_results(odir, pred_type='regression')

Found data for 20 models under dataset


The pred_df dataframe has details about the model_uuid, model_path, ampl_version, model_type, features, splitter and the results for popular metrics that help evaluate the performance. Let us view the contents of the pred_df dataframe.

In [15]:
pred_df.to_csv('./dataset/pred_df.csv')

In [20]:
# View the pred_df dataframe
pred_df.head()

Unnamed: 0,model_uuid,model_path,ampl_version,model_type,dataset_key,features,splitter,split_strategy,split_uuid,model_score_type,...,dropouts,xgb_gamma,xgb_learning_rate,xgb_max_depth,xgb_colsample_bytree,xgb_subsample,xgb_n_estimators,xgb_min_child_weight,model_parameters_dict,feat_parameters_dict
5,7cb431f6-8ef8-4aa3-8fac-12aa5b46878e,dataset/curated_kcna3_ic50_model_7cb431f6-8ef8...,1.6.0,RF,/home/apaulson/repos/AMPL_umbrella/AMPL/atomsc...,rdkit_raw,scaffold,train_valid_test,ab4e8dd3-44f5-4bfe-9d2a-a0ddf43dc653,r2,...,,,,,,,,,"{""rf_estimators"": 500, ""rf_max_depth"": null, ""...",{}
8,be495392-cab1-4ed8-a233-c944d051e3c9,dataset/curated_kcna3_ic50_model_be495392-cab1...,1.6.0,RF,/home/apaulson/repos/AMPL_umbrella/AMPL/atomsc...,rdkit_raw,scaffold,train_valid_test,8daa5687-c2ee-45e4-b385-36164246c419,r2,...,,,,,,,,,"{""rf_estimators"": 500, ""rf_max_depth"": null, ""...",{}
12,03d2f178-2aee-4b8b-8601-97a00fb42624,dataset/curated_kcna3_ic50_model_03d2f178-2aee...,1.6.0,RF,/home/apaulson/repos/AMPL_umbrella/AMPL/atomsc...,rdkit_raw,scaffold,train_valid_test,8daa5687-c2ee-45e4-b385-36164246c419,r2,...,,,,,,,,,"{""rf_estimators"": 500, ""rf_max_depth"": null, ""...",{}
9,ac023bc7-c7b3-406b-a830-b71ed2615a7e,dataset/curated_kcna3_ic50_model_ac023bc7-c7b3...,1.6.0,RF,/home/apaulson/repos/AMPL_umbrella/AMPL/atomsc...,rdkit_raw,scaffold,train_valid_test,088354c9-91d4-4e30-a80b-6798699fde91,r2,...,,,,,,,,,"{""rf_estimators"": 500, ""rf_max_depth"": null, ""...",{}
18,d9436b51-5603-4dc0-a7c2-704e243cbd0b,dataset/curated_kcna3_ic50_model_d9436b51-5603...,1.6.0,RF,/home/apaulson/repos/AMPL_umbrella/AMPL/atomsc...,rdkit_raw,scaffold,train_valid_test,8daa5687-c2ee-45e4-b385-36164246c419,r2,...,,,,,,,,,"{""rf_estimators"": 500, ""rf_max_depth"": null, ""...",{}


In [21]:
pred_df[['model_uuid', 'best_valid_r2_score', 'best_test_r2_score', 'best_train_num_compounds']]

Unnamed: 0,model_uuid,best_valid_r2_score,best_test_r2_score,best_train_num_compounds
5,7cb431f6-8ef8-4aa3-8fac-12aa5b46878e,0.377042,0.26154,259
8,be495392-cab1-4ed8-a233-c944d051e3c9,0.376573,0.263537,259
12,03d2f178-2aee-4b8b-8601-97a00fb42624,0.373521,0.26308,259
9,ac023bc7-c7b3-406b-a830-b71ed2615a7e,0.373251,0.272772,259
18,d9436b51-5603-4dc0-a7c2-704e243cbd0b,0.36822,0.272563,259
16,77969ded-6ef6-42f7-8352-bfc38e6951cc,0.368073,0.269387,259
17,24e9b61a-73ae-4d0e-9471-4ecae9a3916b,0.36677,0.27897,259
19,d589e133-07e1-4ca0-a7bb-bc49b7209a24,0.364822,0.288209,259
11,f0f50ebd-6ec0-452b-91f6-94701258d60f,0.364106,0.26708,259
1,5e966bff-9ec4-4c11-a02b-e2568cec8cff,0.363474,0.282021,259


## Top Performing Model
To pick the top performing model, we sort the `best_valid_r2_score` column in descending order and pick the one that is maximum.

In [22]:
# Top performing model
top_model=pred_df.sort_values(by="best_valid_r2_score", ascending=False).iloc[0,:]
top_model

model_uuid                               7cb431f6-8ef8-4aa3-8fac-12aa5b46878e
model_path                  dataset/curated_kcna3_ic50_model_7cb431f6-8ef8...
ampl_version                                                            1.6.0
model_type                                                                 RF
dataset_key                 /home/apaulson/repos/AMPL_umbrella/AMPL/atomsc...
features                                                            rdkit_raw
splitter                                                             scaffold
split_strategy                                               train_valid_test
split_uuid                               ab4e8dd3-44f5-4bfe-9d2a-a0ddf43dc653
model_score_type                                                           r2
feature_transform_type                                          normalization
model_choice_score                                                   0.377042
best_train_r2_score                                             

## Model Tarball 
The model_path or the location of the tarball where the top performing model is saved is in `top_model.model_path`.

In [8]:
# Top performing model path
top_model.model_path

'dataset/curated_kcna3_ic50_model_7cb431f6-8ef8-4aa3-8fac-12aa5b46878e.tar.gz'

In [9]:
# Top performing model split_uuid
top_model.split_uuid

'ab4e8dd3-44f5-4bfe-9d2a-a0ddf43dc653'

We will need this path in the next tutorial in which we use the trained model to make predictions on a new dataset.