# Perform a Split

One of the goals of machine learning is to build a model that can generalize and perform well on new data. 
The problem is that you may not have new data, but you can simulate this experience by splitting the dataset into train, validation and test sets. 

**Training set**: A subset of the main dataset will feed into the model so that that model can learn the data patterns.

**Validation Set**: This set is used to understand the performance of the model in comparison to different models and hyperparameter choices.

**Test set**: This set checks the final model’s accuracy.

![](img/03_split_explanation.png)

Performing a split helps the model validation process to simulate how your model perform with new data. This tutorial will cover some tools in AMPL to perform a split.

Scaffold split
Rationale between using scaffold and random splits
Explain the split file format
Split only
Build the json

In [1]:
'''We will use the curated dataset that we created in tutorial 2 
and learn how to split it into tran,validation and test sets.'''

import pandas as pd

# Set up
dataset_file = 'dataset/curated_kcna5_ic50.csv'
odir='dataset'

Machine learning (ML) models learn the relationship between molecules and molecular propteries.
These models can dramatically accelerate the screening process by giving researchers information on which molecules are most and least likely to have the desired properties.

ML models, however, are only as good as the data they were trained on. 
In the chemical space, this problem manifests itself when a model is queried with a molecules from a unfamiliar chemical space. 
If a model is trained only on molecules that belong to only a handful of scaffold classes, its ability to predict a molecule with an unfamiliar scaffold is unknown.

# Perform a Scaffold split

Scaffold split is based on the scaffold of the molecules. 
This ensures that the train/validation/test sets are structurally different. 
It is more challenging than a random split where the data is split into train/validation/test at random.


In [2]:
split_type=['scaffold','random']
param_lst=[]
for split_val in split_type :
    params = {
        "verbose": "True",
        "system": "LC",

        # dataset info
        "dataset_key" : dataset_file,
        "datastore": "False",
        "response_cols" : "avg_pIC50",
        "id_col": "compound_id",
        "smiles_col" : "base_rdkit_smiles",
        "result_dir": odir,

        # splitting
        "split_only": "True",
        "previously_split": "False",
        "splitter": split_val,
        "split_valid_frac": "0.15",
        "split_test_frac": "0.15",

        # featurization & training params
        "featurizer": "ecfp",
    }
    param_lst.append(params)

The dataset split table is saved as a .csv in the same directory as the `dataset_key`. 
The name of the split file starts with the `dataset_key` and is followed by the split type (scaffold/random), 
split strategy, and the split_uuid; an unique identifier of the split.

In [3]:
from atomsci.ddm.pipeline import model_pipeline as mp
from atomsci.ddm.pipeline import parameter_parser as parse

split_lst=[]
for params in param_lst :
    pparams = parse.wrapper(params)
    MP = mp.ModelPipeline(pparams)
    split_uuid = MP.split_dataset()
    split_lst.append((params,split_uuid))

  from .autonotebook import tqdm as notebook_tqdm
Skipped loading modules with pytorch-geometric dependency, missing a dependency. No module named 'torch_geometric'
Skipped loading modules with pytorch-geometric dependency, missing a dependency. cannot import name 'DMPNN' from 'deepchem.models.torch_models' (/usr/WS1/he6/AMPL_virtualenv_1.6/lib/python3.9/site-packages/deepchem/models/torch_models/__init__.py)
Skipped loading modules with pytorch-lightning dependency, missing a dependency. No module named 'pytorch_lightning'
Skipped loading some Jax models, missing a dependency. jax requires jaxlib to be installed. See https://github.com/google/jax#installation for installation instructions.
INFO:ATOM:Created a dataset hash 'd9d4152aaea8543c0d1eaaec0362ec7a' from dataset_key '/usr/WS1/he6/code/ATOM/AMPL1.6/atomsci/ddm/examples/tutorials2023/dataset/curated_kcna5_ic50.csv'
DEBUG:ATOM:Attempting to load featurized dataset
DEBUG:ATOM:Exception when trying to load featurized data:
DynamicFe

# Rationale between using scaffold split and random split

A generalizable model will be able to accurately predict the properties of molecules it has never seen before, 
reducing the need to perform extensive manual assays each time a new chemical class is to be tested. 
Generalizable models can predict across multiple different scaffolds and molecule types, while a non-generalizable model cannot.

When the dataset is split using a scaffold split, the test set is structurally different than the training set and this gives a better understanding model generalizability. When using a random split, there is no guarantee that the test set will be structurally different than the training sets.

In [11]:
# display the split file location and names
import os
file_lst=[]
for params, sid in split_lst :
    fname=params['dataset_key']
    dirname=os.path.dirname(fname)
    split_val=params['splitter']

    # find the file that contains the correct uuid
    all_files = os.listdir(dirname)
    for file in all_files:
        if sid in file:
            nfile = os.path.join(dirname, file)
            file_lst.append((nfile,sid,split_val))
            break
print(file_lst)

[('dataset/curated_kcna5_ic50_train_valid_test_scaffold_a5a9a3c5-26a0-4083-9a22-2ec94c182882.csv', 'a5a9a3c5-26a0-4083-9a22-2ec94c182882', 'scaffold'), ('dataset/curated_kcna5_ic50_train_valid_test_random_3199e1ce-95e5-4f1d-9c4d-8ddc742967ac.csv', '3199e1ce-95e5-4f1d-9c4d-8ddc742967ac', 'random')]


# Format of the split file
The split file consits of three columns: cmpd_id is the compound id; subset tells you if the compound is in the train/validation/ test set and fold tell you which fold.

In [12]:
# Explore contents of the split file
file,sid,split_val = file_lst[0]
df=pd.read_csv(file)
df.head(3)

Unnamed: 0,cmpd_id,subset,fold
0,CHEMBL408935,train,0
1,CHEMBL1289299,train,0
2,CHEMBL3262803,train,0


### Show difference in tanimoto difference between training and test comparison between random and scaffold split

In [13]:
import atomsci.ddm.utils.compare_splits_plots as csp

# make a SplitStats object
# call dist_hist_plot

# Building a json file and the split_only flag

We can build a split config file and save it as a json file as shown below. The split_only flag can be set to True if you want to only split the dataset and not proceed with training. 


In [14]:
# Build the json
# an example of how to build a json using a split config file is shown below

import json

split_config = {
"dataset_key" : dataset_file,
"datastore": "False",
"split_only": "True",
"splitter": "scaffold",
"split_valid_frac": "0.15",
"split_test_frac": "0.15",
"previously_split": "False",
"response_cols" : "avg_pIC50",
"id_col": "compound_id",
"smiles_col" : "base_rdkit_smiles",
"result_dir": odir,
"featurizer": "ecfp",
"system": "LC",
"verbose": "True"
}

config_path = "dataset/kcna5_chembl_scaffold_nn_split_config1.json"
with open(config_path, 'w') as out:
    json.dump(split_config, out, sort_keys=False, indent=4, separators=(',', ': '))
    print('Wrote %s' % config_path)

INFO:ATOM:Created a dataset hash 'd9d4152aaea8543c0d1eaaec0362ec7a' from dataset_key '/usr/WS1/he6/code/ATOM/AMPL1.6/atomsci/ddm/examples/tutorials2023/dataset/curated_kcna5_ic50.csv'


DEBUG:ATOM:Attempting to load featurized dataset
DEBUG:ATOM:Exception when trying to load featurized data:
DynamicFeaturization doesn't support get_featurized_dset_name()
INFO:ATOM:Featurized dataset not previously saved for dataset curated_kcna5_ic50, creating new
INFO:ATOM:Featurizing sample 0
DEBUG:ATOM:Number of features: 1024


Wrote dataset/kcna5_chembl_scaffold_nn_split_config1.json


This can be passed to `model_pipeline.py` like so:
```
python atomsci/ddm/pipeline/model_pipeline.py --config dataset/kcna5_chembl_scaffold_nn_split_config1.json
```