# Data preparation

In this tutorial we will go deeper into the different data preparation options that are possible within QSPRpred.

The first step is to load the data and wrap it into a QSPRpred dataset object.
Here we will create a regression dataset, but if you want to learn more about how to prepare
classification data, please check the [classification tutorial](../modelling/classification.ipynb).
If you want to know more about how to specifiy the target property,
you can have a look at the [documentation](https://cddleiden.github.io/QSPRpred/docs/api/qsprpred.data.html#qsprpred.data.data.TargetProperty) on this topic.

In [9]:
import os
import pandas as pd
from IPython.display import display
from qsprpred.data.data import QSPRDataset

df = pd.read_csv('../../tutorial_data/A2A_LIGANDS.tsv', sep='\t')

display(df.head())

os.makedirs("../../tutorial_output/data", exist_ok=True)

dataset = QSPRDataset(
  	df=df, 
  	store_dir="../../tutorial_output/data",
  	name="A2A_LIGANDS",
  	target_props=[{"name": "pchembl_value_Mean", "task": "REGRESSION"}],
  	random_state=42
)

display(dataset.getDF())

Unnamed: 0,SMILES,pchembl_value_Mean
0,Cc1nn(-c2cc(NC(=O)CCN(C)C)nc(-c3ccc(C)o3)n2)c(...,8.68
1,Nc1c(C(=O)Nc2ccc([N+](=O)[O-])cc2)sc2c1cc1CCCC...,4.82
2,O=C(Nc1nc2ncccc2n2c(=O)n(-c3ccccc3)nc12)c1ccccc1,5.65
3,CNC(=O)C12CC1C(n1cnc3c1nc(C#CCCCCC(=O)OC)nc3NC...,5.45
4,CCCn1c(=O)c2c(nc3cc(OC)ccn32)n(CCCNC(=O)c2ccc(...,5.2


  super().__init__(


Unnamed: 0_level_0,SMILES,pchembl_value_Mean,QSPRID,Split_IsTrain
QSPRID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
A2A_LIGANDS_0,Cc1cc(C)n(-c2cc(NC(=O)CCN(C)C)nc(-c3ccc(C)o3)n...,8.68,A2A_LIGANDS_0,True
A2A_LIGANDS_1,Nc1c(C(=O)Nc2ccc([N+](=O)[O-])cc2)sc2nc3c(cc12...,4.82,A2A_LIGANDS_1,True
A2A_LIGANDS_2,O=C(Nc1nc2ncccc2n2c(=O)n(-c3ccccc3)nc12)c1ccccc1,5.65,A2A_LIGANDS_2,True
A2A_LIGANDS_3,CNC(=O)C12CC1C(n1cnc3c(NCc4cccc(Cl)c4)nc(C#CCC...,5.45,A2A_LIGANDS_3,True
A2A_LIGANDS_4,CCCn1c(=O)c2c(nc3cc(OC)ccn32)n(CCCNC(=O)c2ccc(...,5.20,A2A_LIGANDS_4,True
...,...,...,...,...
A2A_LIGANDS_4077,CNc1ncc(C(=O)NCc2ccc(OC)cc2)c2nc(-c3ccco3)nn12,7.09,A2A_LIGANDS_4077,False
A2A_LIGANDS_4078,Nc1nc(-c2ccco2)c2ncn(C(=O)NCCc3ccccc3)c2n1,8.22,A2A_LIGANDS_4078,True
A2A_LIGANDS_4079,Nc1nc(CSc2nnc(N)s2)nc(Nc2ccc(F)cc2)n1,4.89,A2A_LIGANDS_4079,False
A2A_LIGANDS_4080,CCCOc1ccc(C=Cc2cc3c(c(=O)n(C)c(=O)n3C)n2C)cc1,6.51,A2A_LIGANDS_4080,True


You might have seen the preprocessing steps below in the quick_start or one of the other tutorials.
However, there are many more preprocessing options available in QSPRpred.

The `QSPRData.prepareDataset` function allows you to specify a number of preprocessing steps,
which are then applied in a fixed order. If you want to have more control over the preprocessing steps,
you can have a look at the [advanced data preparation tutorial](preprocessing.ipynb).

In this tutorial we will go through the different preprocessing steps that are available in QSPRpred.
Some have their own dedicated tutorial, which will be linked to below.

The preprocessing steps that `QSPRData.prepareDataset` applies are:
1. SMILES standardization
2. [feature calculation](descriptors.ipynb)
3. data filtering
4. fill missing feature values
5. [split into training and test set](data_splitting.ipynb)
6. feature selection
7. feature standardization


## SMILES standardization

The first step in the data preparation is to standardize the SMILES strings (`smiles_standardizer`).
By default the [ChEMBL structure pipeline](https://github.com/chembl/ChEMBL_Structure_Pipeline) is used for this, but you can also specify your own function.

To use the default ChEMBL structure pipeline, you can pass `"chembl"` (default), if
you want to skip this step, you can pass None.

To use your own function, you can pass a function that takes a SMILES string as input and returns a SMILES string as output.


In [10]:
from qsprpred.data.utils.descriptorsets import FingerprintSet
from qsprpred.data.utils.descriptorcalculator import MoleculeDescriptorsCalculator
from qsprpred.data.utils.datasplitters import RandomSplit


# Calculate MorganFP and physicochemical properties
feature_calculator = MoleculeDescriptorsCalculator(desc_sets = [FingerprintSet(fingerprint_type="MorganFP", radius=3, nBits=2048)])

# Do a random split for creating the train (85%) and test set (15%)
rand_split = RandomSplit(test_fraction=0.2, dataset=dataset)

# custom standardizer that canonicalizes the SMILES
def custom_standardizer(smiles):
    from rdkit import Chem
    mol = Chem.MolFromSmiles(smiles)
    smiles = Chem.MolToSmiles(mol, canonical=True)
    return smiles

# calculate compound features and split dataset into train and test
dataset.prepareDataset(
    smiles_standardizer = custom_standardizer,
    split=rand_split,
    feature_calculators=[feature_calculator],
    recalculate_features=True,
)

print(dataset.getDF().head())

Missing values filled with nan


                                                          SMILES  \
QSPRID                                                             
A2A_LIGANDS_0  Cc1cc(C)n(-c2cc(NC(=O)CCN(C)C)nc(-c3ccc(C)o3)n...   
A2A_LIGANDS_1  Nc1c(C(=O)Nc2ccc([N+](=O)[O-])cc2)sc2nc3c(cc12...   
A2A_LIGANDS_2   O=C(Nc1nc2ncccc2n2c(=O)n(-c3ccccc3)nc12)c1ccccc1   
A2A_LIGANDS_3  CNC(=O)C12CC1C(n1cnc3c(NCc4cccc(Cl)c4)nc(C#CCC...   
A2A_LIGANDS_4  CCCn1c(=O)c2c(nc3cc(OC)ccn32)n(CCCNC(=O)c2ccc(...   

               pchembl_value_Mean         QSPRID  Split_IsTrain  
QSPRID                                                           
A2A_LIGANDS_0                8.68  A2A_LIGANDS_0           True  
A2A_LIGANDS_1                4.82  A2A_LIGANDS_1           True  
A2A_LIGANDS_2                5.65  A2A_LIGANDS_2           True  
A2A_LIGANDS_3                5.45  A2A_LIGANDS_3           True  
A2A_LIGANDS_4                5.20  A2A_LIGANDS_4           True  


## Data filtering

A number of filters can be applied to the data to remove unwanted compounds.
By default, a filter to remove duplicates is applied (`RepeatsFilter`).

A number of other filters is available in QSPRpred, which can be used by passing a list of filter objects to the `filters` argument.
You can find all available filters in the [documentation](https://cddleiden.github.io/QSPRpred/docs/api/qsprpred.data.utils.html#module-qsprpred.data.utils.datafilters).

Here, we will use the `RepeatsFilter` and the `CategoryFilter` to remove duplicates and compounds based on the value of a categorical feature.

In [11]:
import numpy as np

dataset.addProperty(name="FakeProperty", data=np.random.choice(["Wow", "Nope"], len(dataset)))
dataset.getDF().head()

Unnamed: 0_level_0,SMILES,pchembl_value_Mean,QSPRID,Split_IsTrain,FakeProperty
QSPRID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
A2A_LIGANDS_0,Cc1cc(C)n(-c2cc(NC(=O)CCN(C)C)nc(-c3ccc(C)o3)n...,8.68,A2A_LIGANDS_0,True,Nope
A2A_LIGANDS_1,Nc1c(C(=O)Nc2ccc([N+](=O)[O-])cc2)sc2nc3c(cc12...,4.82,A2A_LIGANDS_1,True,Wow
A2A_LIGANDS_2,O=C(Nc1nc2ncccc2n2c(=O)n(-c3ccccc3)nc12)c1ccccc1,5.65,A2A_LIGANDS_2,True,Wow
A2A_LIGANDS_3,CNC(=O)C12CC1C(n1cnc3c(NCc4cccc(Cl)c4)nc(C#CCC...,5.45,A2A_LIGANDS_3,True,Nope
A2A_LIGANDS_4,CCCn1c(=O)c2c(nc3cc(OC)ccn32)n(CCCNC(=O)c2ccc(...,5.2,A2A_LIGANDS_4,True,Wow


In [13]:
from qsprpred.data.utils.descriptorsets import FingerprintSet
from qsprpred.data.utils.descriptorcalculator import MoleculeDescriptorsCalculator
from qsprpred.data.utils.datasplitters import RandomSplit
from qsprpred.data.utils.datafilters import CategoryFilter, RepeatsFilter

# Calculate MorganFP and physicochemical properties
feature_calculator = MoleculeDescriptorsCalculator(desc_sets = [FingerprintSet(fingerprint_type="MorganFP", radius=3, nBits=2048)])

# Do a random split for creating the train (85%) and test set (15%)
rand_split = RandomSplit(test_fraction=0.2, dataset=dataset)

# calculate compound features and split dataset into train and test
dataset.prepareDataset(
    datafilters=[RepeatsFilter(keep=True),
                 CategoryFilter(name="FakeProperty", values=["Wow"], keep=True)], # only keep compounds with FakeProperty="Wow"
    split=rand_split,
    feature_calculators=[feature_calculator],
    recalculate_features=True,
)

dataset.getDF().head()

Missing values filled with nan


Unnamed: 0_level_0,SMILES,pchembl_value_Mean,QSPRID,Split_IsTrain,FakeProperty
QSPRID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
A2A_LIGANDS_1,Nc1c(C(=O)Nc2ccc([N+](=O)[O-])cc2)sc2nc3c(cc12...,4.82,A2A_LIGANDS_1,True,Wow
A2A_LIGANDS_2,O=C(Nc1nc2ncccc2n2c(=O)n(-c3ccccc3)nc12)c1ccccc1,5.65,A2A_LIGANDS_2,True,Wow
A2A_LIGANDS_4,CCCn1c(=O)c2c(nc3cc(OC)ccn32)n(CCCNC(=O)c2ccc(...,5.2,A2A_LIGANDS_4,True,Wow
A2A_LIGANDS_5,Cn1c(-n2nccn2)nc2c(N)nc(CCc3ccccc3)nc21,8.33,A2A_LIGANDS_5,True,Wow
A2A_LIGANDS_7,CCCn1c(=O)c2c(nc3n2CCCN3c2ccc(OCCN3CCCC3)cc2)n...,5.62,A2A_LIGANDS_7,True,Wow


## Filling missing features

After feature calculation, some features might have missing values (not relevant for Morgan fingerprints).
By default, these are left as NaN, but you can also specify a value to fill them with.

In [14]:
from qsprpred.data.utils.descriptorsets import FingerprintSet
from qsprpred.data.utils.descriptorcalculator import MoleculeDescriptorsCalculator
from qsprpred.data.utils.datasplitters import RandomSplit

# Calculate MorganFP and physicochemical properties
feature_calculator = MoleculeDescriptorsCalculator(desc_sets = [FingerprintSet(fingerprint_type="MorganFP", radius=3, nBits=2048)])

# Do a random split for creating the train (85%) and test set (15%)
rand_split = RandomSplit(test_fraction=0.2, dataset=dataset)

# calculate compound features and split dataset into train and test
dataset.prepareDataset(
    split=rand_split,
    feature_calculators=[feature_calculator],
    feature_fill_value=5, # fill missing values with 5
    recalculate_features=True,
)

Missing values filled with 5


Unnamed: 0_level_0,SMILES,pchembl_value_Mean,QSPRID,Split_IsTrain,FakeProperty
QSPRID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
A2A_LIGANDS_1,Nc1c(C(=O)Nc2ccc([N+](=O)[O-])cc2)sc2nc3c(cc12...,4.82,A2A_LIGANDS_1,True,Wow
A2A_LIGANDS_2,O=C(Nc1nc2ncccc2n2c(=O)n(-c3ccccc3)nc12)c1ccccc1,5.65,A2A_LIGANDS_2,True,Wow
A2A_LIGANDS_4,CCCn1c(=O)c2c(nc3cc(OC)ccn32)n(CCCNC(=O)c2ccc(...,5.2,A2A_LIGANDS_4,True,Wow
A2A_LIGANDS_5,Cn1c(-n2nccn2)nc2c(N)nc(CCc3ccccc3)nc21,8.33,A2A_LIGANDS_5,True,Wow
A2A_LIGANDS_7,CCCn1c(=O)c2c(nc3n2CCCN3c2ccc(OCCN3CCCC3)cc2)n...,5.62,A2A_LIGANDS_7,True,Wow


## Feature selection

After feature calculation, you might want to select a subset of features to use for modelling.
There are a number of feature selection methods available in QSPRpred, which can be used by passing a list of feature selection objects to the `feature_filters` argument.

You can find an overview of available feature selection methods in the [documentation](https://cddleiden.github.io/QSPRpred/docs/api/qsprpred.data.utils.html#module-qsprpred.data.utils.featurefilters).


Here, we will use the `HighCorrelationFilter` to remove features that are highly correlated with each other.

In [20]:
from qsprpred.data.utils.descriptorsets import FingerprintSet
from qsprpred.data.utils.descriptorcalculator import MoleculeDescriptorsCalculator
from qsprpred.data.utils.datasplitters import RandomSplit
from qsprpred.data.utils.featurefilters import HighCorrelationFilter

# Calculate MorganFP and physicochemical properties
feature_calculator = MoleculeDescriptorsCalculator(desc_sets = [FingerprintSet(fingerprint_type="MorganFP", radius=3, nBits=2048)])

# Do a random split for creating the train (85%) and test set (15%)
rand_split = RandomSplit(test_fraction=0.2, dataset=dataset)

# calculate compound features and split dataset into train and test
dataset.prepareDataset(
    split=rand_split,
    feature_calculators=[feature_calculator],
    recalculate_features=True,
    feature_filters=[HighCorrelationFilter(th=0.95)] # remove features with correlation > 0.95
)

print(f"Number of fingerprint bits after filtering: {len(dataset.getDescriptors())}")

Missing values filled with nan
  c /= stddev[:, None]
  c /= stddev[None, :]


Number of fingerprint bits after filtering: 2006


## Feature standardization

You can also specify scikit-learn feature standardization methods to apply to the features.
In this example we will use the scikit-learn `StandardScaler` to standardize the features.
Note. that this is not useful for Morgan fingerprints, but we will use it here for demonstration purposes.

In [21]:
from qsprpred.data.utils.descriptorsets import FingerprintSet
from qsprpred.data.utils.descriptorcalculator import MoleculeDescriptorsCalculator
from qsprpred.data.utils.datasplitters import RandomSplit
from qsprpred.data.utils.feature_standardization import SKLearnStandardizer
from sklearn.preprocessing import StandardScaler

# Calculate MorganFP and physicochemical properties
feature_calculator = MoleculeDescriptorsCalculator(desc_sets = [FingerprintSet(fingerprint_type="MorganFP", radius=3, nBits=2048)])

# Do a random split for creating the train (85%) and test set (15%)
rand_split = RandomSplit(test_fraction=0.2, dataset=dataset)

# calculate compound features and split dataset into train and test
dataset.prepareDataset(
    split=rand_split,
    feature_calculators=[feature_calculator],
    recalculate_features=True,
    feature_standardizer=SKLearnStandardizer(StandardScaler) # standardize features
)

Missing values filled with nan
