# Implementing Custom Cross-Validation and Train-Test Splits

Data set splits in QSPRpred adhere to the splitting API used by scikit-learn, which means you can use those classes directly with QSPRpred data sets:

In [1]:
import pandas as pd

from qsprpred.data import QSPRDataset
from qsprpred.data.descriptors.fingerprints import MorganFP

dataset = QSPRDataset(
    df=pd.read_csv('../../tutorial_data/A2A_LIGANDS.tsv', sep='\t'),
    store_dir="../../tutorial_output/data",
    name="CustomSplitDataSet",
    target_props=[{"name": "pchembl_value_Mean", "task": "REGRESSION"}],
    random_state=42
)
dataset.addDescriptors([MorganFP(nBits=128)])
dataset.getDescriptors().shape

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd
Failed to find the pandas get_adjustment() function to patch
Failed to patch pandas - PandasTools will have limited functionality


(4082, 128)

In [2]:
from sklearn.model_selection import ShuffleSplit

# train-test split
dataset.split(ShuffleSplit(n_splits=1))
train, test = dataset.getFeatures()
print(train.shape, test.shape)

(3673, 128) (409, 128)


In [3]:
# cross validation / resampling
for train, test, *_ in dataset.iterFolds(ShuffleSplit(n_splits=3)):
    print(train.shape, test.shape)

(3305, 128) (368, 128)
(3305, 128) (368, 128)
(3305, 128) (368, 128)


However, QSPRpred also defines the `DataSplit` interface that provides extended features:

In [4]:
from qsprpred.data.sampling.splits import DataSplit

import numpy as np
from typing import Iterable

class MySplit(DataSplit):
    """My customized split"""

    def __init__(self, test_ids: list[list[str]]):
        super().__init__()
        self.test_ids = test_ids

    def split(
        self,
        X: np.ndarray | pd.DataFrame, 
        y: np.ndarray | pd.DataFrame | pd.Series
    ) -> Iterable[tuple[list[int], list[int]]]:
        """Uses only the specified IDs from the data set as test set
        Returns an iterator of training and test split indices, 
        just like a scikit learn splitter would.
        """
        splits = []
        for test_ids in self.test_ids:
            test = np.where(X.index.isin(test_ids))[0]
            train = np.where(~X.index.isin(test_ids))[0]
            splits.append((train, test))
        return splits

In [5]:
# train-test split
my_split = MySplit([["CustomSplitDataSet_0007", "CustomSplitDataSet_0077"]])
dataset.split(my_split)
train, test = dataset.getFeatures()
test.index # desired IDs in the test set

Index(['CustomSplitDataSet_0007', 'CustomSplitDataSet_0077'], dtype='object', name='QSPRID')

In [6]:
# cross-validation / resampling
my_split = MySplit([["CustomSplitDataSet_0006", "CustomSplitDataSet_0066"], ["CustomSplitDataSet_0008", "CustomSplitDataSet_0088"]])
for train, test, *_ in dataset.iterFolds(my_split):
    print(test.index) # desired IDs in the test set

Index(['CustomSplitDataSet_0006', 'CustomSplitDataSet_0066'], dtype='object', name='QSPRID')
Index(['CustomSplitDataSet_0008', 'CustomSplitDataSet_0088'], dtype='object', name='QSPRID')


So far this is not much different than a simple `scikit-learn` split, but using `RandomSplit` also provides us with access to the data set being split. Here is an implementation of a splitter that moves molecules with certain substructures into different splits and uses the `RandomSplit` API to access the data set being split:

In [7]:
class MyScaffoldSplit(DataSplit):
    """My customized scaffold split"""

    def __init__(self, test_smarts: list[list[str]]):
        super().__init__()
        self.test_smarts = test_smarts

    def split(
        self,
        X: np.ndarray | pd.DataFrame, 
        y: np.ndarray | pd.DataFrame | pd.Series
    ) -> Iterable[tuple[list[int], list[int]]]:
        """Matches molecules with certain SMARTS substructures 
        and puts them in the test set.
        """
        dataset = self.getDataSet()
        splits = []
        for test_smart in self.test_smarts:
            test_subset = dataset.searchWithSMARTS(test_smart, operator="and")
            test_ids = test_subset.getProperty("QSPRID").values
            test = np.where(X.index.isin(test_ids))[0]
            train = np.where(~X.index.isin(test_ids))[0]
            splits.append((train, test))
        return splits

In [8]:
my_split = MyScaffoldSplit([["[ar]", "NS(=O)(=O)"], ["[ar]", "C(=O)[OH]"]])
test_indices = []
for train, test, *_ in dataset.iterFolds(my_split):
    test_indices.append(test.index)

In [9]:
from qsprpred.plotting.grid_visualizers import smiles_to_grid, interactive_grid

# aromatic sulfonamides
smiles_to_grid(dataset.searchWithIndex(test_indices[0]).smiles, impl=interactive_grid)

MolGridWidget()

In [10]:
# aromatic carboxylic acids
smiles_to_grid(dataset.searchWithIndex(test_indices[1]).smiles, impl=interactive_grid)

MolGridWidget()