<a href="https://colab.research.google.com/github/ChemistZee/ml_for_molecules/blob/main/Dataset_splitting.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Splitting dataset

Once the dataset is cleaned up, we can create the train, validation and test splits.

There are libraries available to split the dataset based on the output value, molecular weight, scaffold etc. This approach requires converting the CSV file to the library-dependent which is sometimes cumbersome.

For simplicity, we will first randomly split the dataset. We will use the QM9 dataset with ```gap``` as the output (target).

In [1]:
# import pandas library
import pandas as pd

# load the dataframe as CSV from URL.
# If you upload the file to Colab, replace the URL with the file name
df = pd.read_csv("https://deepchemdata.s3-us-west-1.amazonaws.com/datasets/qm9.csv")

# look at the top 5 entries
df.head()

Unnamed: 0,mol_id,smiles,A,B,C,mu,alpha,homo,lumo,gap,...,zpve,u0,u298,h298,g298,cv,u0_atom,u298_atom,h298_atom,g298_atom
0,gdb_1,C,157.7118,157.70997,157.70699,0.0,13.21,-0.3877,0.1171,0.5048,...,0.044749,-40.47893,-40.476062,-40.475117,-40.498597,6.469,-395.999595,-398.64329,-401.014647,-372.471772
1,gdb_2,N,293.60975,293.54111,191.39397,1.6256,9.46,-0.257,0.0829,0.3399,...,0.034358,-56.525887,-56.523026,-56.522082,-56.544961,6.316,-276.861363,-278.620271,-280.399259,-259.338802
2,gdb_3,O,799.58812,437.90386,282.94545,1.8511,6.31,-0.2928,0.0687,0.3615,...,0.021375,-76.404702,-76.401867,-76.400922,-76.422349,6.002,-213.087624,-213.974294,-215.159658,-201.407171
3,gdb_4,C#C,0.0,35.610036,35.610036,0.0,16.28,-0.2845,0.0506,0.3351,...,0.026841,-77.308427,-77.305527,-77.304583,-77.327429,8.574,-385.501997,-387.237686,-389.016047,-365.800724
4,gdb_5,C#N,0.0,44.593883,44.593883,2.8937,12.99,-0.3604,0.0191,0.3796,...,0.016601,-93.411888,-93.40937,-93.408425,-93.431246,6.278,-301.820534,-302.906752,-304.091489,-288.720028


[Fast-ML](https://pypi.org/project/fast-ml/) package has in-built functionalities to analyze the datasets but is not Chemistry-aware. As we are randomly spiltting the dataset, we can use this package.

In [10]:
# install Fast-ML
! pip install fast_ml



In [13]:
# import the function to split into train-valid-test
from fast_ml.model_development import train_valid_test_split

In [5]:
# we will split the dataset as train-valid-test = 0.8:0.1:0.1
X_train, y_train, X_valid, y_valid, \
X_test, y_test = train_valid_test_split(df[["smiles","gap"]], target = "gap", train_size=0.8,
                                        valid_size=0.1, test_size=0.1)


In [6]:
X_test

Unnamed: 0,smiles
27644,Cc1nc(c(o1)N)C#C
11378,COC1=NCC(C)N1
5248,CCCc1cc[nH]c1
103226,CC1(CO)C2C3CC3C12
57599,CC(O)(C=O)C1CCC1
...,...
102883,CCC(CCC#N)C#N
33230,CCCCc1cocn1
98353,CCC(=O)C(C)OC=O
42925,O=C1C(CC#C)C2CN12


In [7]:
y_test

Unnamed: 0,gap
27644,0.2032
11378,0.2774
5248,0.2502
103226,0.2712
57599,0.2224
...,...
102883,0.3360
33230,0.2455
98353,0.2295
42925,0.2303


In case of more Chemistry-aware dataset splitting, pacakages like [deepchem](https://deepchem.readthedocs.io/en/latest/index.html) can be used. However, the CSV dataset must be converted into a dataset class before the splitting can be performed.

Let's try splitting the dataset based on molecular weight in deepchem.

In [14]:
# install deepchem
!pip install deepchem

Collecting deepchem
  Downloading deepchem-2.5.0-py3-none-any.whl.metadata (1.1 kB)
Downloading deepchem-2.5.0-py3-none-any.whl (552 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m552.4/552.4 kB[0m [31m18.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: deepchem
Successfully installed deepchem-2.5.0


In [15]:
import deepchem as dc

Instructions for updating:
experimental_relax_shapes is deprecated, use reduce_retracing instead


As the kernal restarted, we will reload the QM9 dataset.

In [16]:
# import that pandas library
import pandas as pd

# load the dataframe as CSV from URL.
# If you upload the file to Colab, replace the URL with the file name
df = pd.read_csv("https://deepchemdata.s3-us-west-1.amazonaws.com/datasets/qm9.csv")

We will use the ``smiles`` and ``gap`` values from the dataset as before and create the ``NumpyDataset`` object in deepchem. The documentation for dataset in deepchem can be found [here](https://deepchem.readthedocs.io/en/latest/api_reference/data.html#datasets)

In [17]:
# create the deepchem dataset object
# note ids arg is necessary for splitting
dataset = dc.data.NumpyDataset.from_dataframe(df[["smiles","gap"]],
                                              X="smiles",y="gap", ids="smiles")

One can look as the ``X`` and ``y`` values to ensure proper loading of the dataset.

In [20]:
dataset.y

array([[0.5048],
       [0.3399],
       [0.3615],
       ...,
       [0.2953],
       [0.3003],
       [0.3058]])

In [23]:
dataset.X

array([['C'],
       ['N'],
       ['O'],
       ...,
       ['C1N2C3C4C5C2C13CN45'],
       ['C1N2C3C4C5CC13C2C45'],
       ['C1N2C3C4C5OC13C2C45']], dtype=object)

We will perform molecular weight based split. More documentation on splitting methods in deepchem can be found [here](https://deepchem.readthedocs.io/en/latest/api_reference/splitters.html)

In [29]:
#!pip install rdkit
from rdkit import Chem
from rdkit.Chem import AllChem
# create the molecular weight splitter object
molecularweightsplitter = dc.splits.MolecularWeightSplitter()

train_dataset, valid_dataset, test_dataset \
 = molecularweightsplitter.train_valid_test_split(
    dataset=dataset, frac_train = 0.8, frac_valid = 0.1,
    frac_test = 0.1)

We can convert the dataset objects back to pandas dataframe with ``to_dataframe`` for easy analysis, if needed.

In [30]:
train_dataset, valid_dataset, test_dataset = train_dataset.to_dataframe(), valid_dataset.to_dataframe(),test_dataset.to_dataframe()

In [31]:
test_dataset

Unnamed: 0,X,y,ids
0,CN=C1COC(C)CO1,0.2608,CN=C1COC(C)CO1
1,CCC1CC1([NH3+])C([O-])=O,0.2692,CCC1CC1([NH3+])C([O-])=O
2,CCC1OC1(C)C(N)=O,0.2639,CCC1OC1(C)C(N)=O
3,C[NH2+]C1CC1(C)C([O-])=O,0.1473,C[NH2+]C1CC1(C)C([O-])=O
4,COC1CC1(C)C(N)=O,0.2684,COC1CC1(C)C(N)=O
...,...,...,...
13384,OCC(O)CC(F)(F)F,0.3228,OCC(O)CC(F)(F)F
13385,C(C(CO)C(F)(F)F)O,0.3501,C(C(CO)C(F)(F)F)O
13386,OCCC(O)C(F)(F)F,0.3406,OCCC(O)C(F)(F)F
13387,CC(O)C(O)C(F)(F)F,0.3266,CC(O)C(O)C(F)(F)F
