### Split data into training a test sets using the SIMPD algorithm, which approximates time-based splits

In [49]:
import pandas as pd
from ga_lib_3 import run_GA_SIMPD
import seaborn as sns
import datamol as dm
import numpy as np
import json

Read the Biogen solubility data

In [2]:
df = pd.read_csv("biogen_solubility.csv")

The SMIPD method is set up for classification models so we'll define logS >= 4 at "active" (soluble) and logS < 4 as "inactive" (insoluble).  

In [34]:
df['bin_sol'] = [ "active" if x else "inactive"  for x in (df.logS > -4).values]

In [35]:
simpd_res = run_GA_SIMPD(df,smilesCol="SMILES",actCol="bin_sol",return_random_result=False)

n_gen |  n_eval |   cv (min)   |   cv (avg)   |  n_nds  |     eps      |  indicator  
    1 |     500 |  0.00000E+00 |  0.00000E+00 |      93 |            - |            -
    2 |    1000 |  0.00000E+00 |  0.00000E+00 |      95 |  1.000000000 |        nadir
    3 |    1500 |  0.00000E+00 |  0.00000E+00 |      99 |  0.910868670 |        nadir
    4 |    2000 |  0.00000E+00 |  0.00000E+00 |     101 |  0.083052118 |        nadir
    5 |    2500 |  0.00000E+00 |  0.00000E+00 |     103 |  0.026642336 |        nadir
    6 |    3000 |  0.00000E+00 |  0.00000E+00 |     106 |  0.014565726 |        nadir
    7 |    3500 |  0.00000E+00 |  0.00000E+00 |     108 |  0.003623751 |            f
    8 |    4000 |  0.00000E+00 |  0.00000E+00 |     112 |  0.014705910 |            f
    9 |    4500 |  0.00000E+00 |  0.00000E+00 |     116 |  0.107922535 |        nadir
   10 |    5000 |  0.00000E+00 |  0.00000E+00 |     122 |  0.149827870 |        nadir
   11 |    5500 |  0.00000E+00 |  0.00000E+00 |     12

Element 2 in the return value from **run_GA_SIMPD** has the splits. 

In [43]:
res = simpd_res[2]

In [59]:
tests_inds = []
train_inds = []
for tmp_sol in range(len(res.F)):
        tests_inds.append(np.arange(len(res.X[tmp_sol]))[(res.X[tmp_sol])].tolist())
        train_inds.append(np.arange(len(res.X[tmp_sol]))[~(res.X[tmp_sol])].tolist())

Create a dictionary with the indices of the training and test set molecules

In [60]:
out_dict = {"train" : train_inds, "test" : tests_inds}

Write the splits to disk. 

In [61]:
with open("SIMPD_splits.json","w") as ofs:
    json.dump(out_dict,ofs)

As a quick test, load the splits

In [4]:
ifs = open("SIMPD_splits.json")
simpd_splits = json.load(ifs)