In [2]:
# suppres numpy RuntimeWarning
import warnings

warnings.filterwarnings("ignore", category=RuntimeWarning)

# Data preparation

In this tutorial we will go deeper into the different data preparation options that are possible within QSPRpred.

The first step is to load the data and wrap it into a QSPRpred dataset object.
Here we will create a regression dataset, but if you want to learn more about how to prepare
classification data, please check the [classification tutorial](../modelling/classification.ipynb).
If you want to know more about how to specifiy the target property,
you can have a look at the [documentation](https://cddleiden.github.io/QSPRpred/docs/api/qsprpred.data.html#qsprpred.data.data.TargetProperty) on this topic.

In [3]:
import os

from qsprpred.data import QSPRDataset

os.makedirs("../../tutorial_output/data", exist_ok=True)

dataset = QSPRDataset.fromTableFile(
    filename="../../tutorial_data/A2A_LIGANDS.tsv",
    store_dir="../../tutorial_output/data",
    name="PreparationTutorialDataset",
    target_props=[{"name": "pchembl_value_Mean", "task": "REGRESSION"}],
    random_state=42
)

dataset.getDF()

Failed to find the pandas get_adjustment() function to patch
Failed to patch pandas - PandasTools will have limited functionality


Unnamed: 0_level_0,SMILES,pchembl_value_Mean,Year,QSPRID,pchembl_value_Mean_original
QSPRID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
PreparationTutorialDataset_0000,Cc1nn(-c2cc(NC(=O)CCN(C)C)nc(-c3ccc(C)o3)n2)c(...,8.68,2008.0,PreparationTutorialDataset_0000,8.68
PreparationTutorialDataset_0001,Nc1c(C(=O)Nc2ccc([N+](=O)[O-])cc2)sc2c1cc1CCCC...,4.82,2010.0,PreparationTutorialDataset_0001,4.82
PreparationTutorialDataset_0002,O=C(Nc1nc2ncccc2n2c(=O)n(-c3ccccc3)nc12)c1ccccc1,5.65,2009.0,PreparationTutorialDataset_0002,5.65
PreparationTutorialDataset_0003,CNC(=O)C12CC1C(n1cnc3c1nc(C#CCCCCC(=O)OC)nc3NC...,5.45,2009.0,PreparationTutorialDataset_0003,5.45
PreparationTutorialDataset_0004,CCCn1c(=O)c2c(nc3cc(OC)ccn32)n(CCCNC(=O)c2ccc(...,5.20,2019.0,PreparationTutorialDataset_0004,5.20
...,...,...,...,...,...
PreparationTutorialDataset_4077,CNc1ncc(C(=O)NCc2ccc(OC)cc2)c2nc(-c3ccco3)nn12,7.09,2018.0,PreparationTutorialDataset_4077,7.09
PreparationTutorialDataset_4078,Nc1nc(-c2ccco2)c2ncn(C(=O)NCCc3ccccc3)c2n1,8.22,2008.0,PreparationTutorialDataset_4078,8.22
PreparationTutorialDataset_4079,Nc1nc(Nc2ccc(F)cc2)nc(CSc2nnc(N)s2)n1,4.89,2010.0,PreparationTutorialDataset_4079,4.89
PreparationTutorialDataset_4080,CCCOc1ccc(C=Cc2cc3c(c(=O)n(C)c(=O)n3C)n2C)cc1,6.51,2013.0,PreparationTutorialDataset_4080,6.51


You might have seen the preprocessing steps below in the quick_start or one of the other tutorials.
However, there are many more preprocessing options available in QSPRpred.

The `QSPRData.prepareDataset` function allows you to specify a number of preprocessing steps,
which are then applied in a fixed order, although you can also use the individual preprocessing steps if
you want to have more control.

In this tutorial we will go through the different preprocessing steps that are available in QSPRpred.
Some have their own dedicated tutorial, which will be linked to below.

The preprocessing steps that `QSPRData.prepareDataset` applies are:
1. SMILES standardization
2. [feature calculation](descriptors.ipynb)
3. data filtering
4. fill missing feature values
5. [split into training and test set](data_splitting.ipynb)
6. feature selection
7. feature standardization
8. [outlier removal/applicability domain](applicability_domain.ipynb)


## SMILES standardization

The first step in the data preparation is to standardize the SMILES strings (`smiles_standardizer`).
By default the [ChEMBL structure pipeline](https://github.com/chembl/ChEMBL_Structure_Pipeline) is used for this, but you can also specify your own function.

To use the default ChEMBL structure pipeline, you can pass `"chembl"` (default), if
you want to skip this step, you can pass None.

To use your own function, you can pass a function that takes a SMILES string as input and returns a SMILES string as output.


In [4]:
from qsprpred.data.descriptors.fingerprints import MorganFP
from qsprpred.data.sampling.splits import RandomSplit

# custom standardizer that canonicalizes the SMILES
def custom_standardizer(smiles):
    from rdkit import Chem
    mol = Chem.MolFromSmiles(smiles)
    smiles = Chem.MolToSmiles(mol, canonical=True)
    return smiles


# calculate compound features and split dataset into train and test
dataset.prepareDataset(
    smiles_standardizer=custom_standardizer,
    split=RandomSplit(test_fraction=0.2, dataset=dataset),
    feature_calculators=[MorganFP(radius=3, nBits=2048)],
    recalculate_features=True,
)

dataset.getDF().head()

Unnamed: 0_level_0,SMILES,pchembl_value_Mean,Year,QSPRID,pchembl_value_Mean_original
QSPRID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
PreparationTutorialDataset_0000,Cc1cc(C)n(-c2cc(NC(=O)CCN(C)C)nc(-c3ccc(C)o3)n...,8.68,2008.0,PreparationTutorialDataset_0000,8.68
PreparationTutorialDataset_0001,Nc1c(C(=O)Nc2ccc([N+](=O)[O-])cc2)sc2nc3c(cc12...,4.82,2010.0,PreparationTutorialDataset_0001,4.82
PreparationTutorialDataset_0002,O=C(Nc1nc2ncccc2n2c(=O)n(-c3ccccc3)nc12)c1ccccc1,5.65,2009.0,PreparationTutorialDataset_0002,5.65
PreparationTutorialDataset_0003,CNC(=O)C12CC1C(n1cnc3c(NCc4cccc(Cl)c4)nc(C#CCC...,5.45,2009.0,PreparationTutorialDataset_0003,5.45
PreparationTutorialDataset_0004,CCCn1c(=O)c2c(nc3cc(OC)ccn32)n(CCCNC(=O)c2ccc(...,5.2,2019.0,PreparationTutorialDataset_0004,5.2


## Data filtering

A number of filters can be applied to the data to remove unwanted compounds.
By default, a filter to remove duplicates is applied (`RepeatsFilter`). It removes compounds that have duplicate descriptors.

A number of other filters is available in QSPRpred, which can be used by passing a list of filter objects to the `filters` argument.
You can find all available filters in the [documentation](https://cddleiden.github.io/QSPRpred/docs/api/qsprpred.data.utils.html#module-qsprpred.data.utils.data_filters).

Here, we will use the `RepeatsFilter` and the `CategoryFilter` to remove duplicates and compounds based on the value of a categorical feature.

In [5]:
import numpy as np

dataset.addProperty(name="FakeProperty",
                    data=np.random.choice(["Wow", "Nope"], len(dataset)))
dataset.getDF().head()

Unnamed: 0_level_0,SMILES,pchembl_value_Mean,Year,QSPRID,pchembl_value_Mean_original,FakeProperty
QSPRID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
PreparationTutorialDataset_0000,Cc1cc(C)n(-c2cc(NC(=O)CCN(C)C)nc(-c3ccc(C)o3)n...,8.68,2008.0,PreparationTutorialDataset_0000,8.68,Wow
PreparationTutorialDataset_0001,Nc1c(C(=O)Nc2ccc([N+](=O)[O-])cc2)sc2nc3c(cc12...,4.82,2010.0,PreparationTutorialDataset_0001,4.82,Nope
PreparationTutorialDataset_0002,O=C(Nc1nc2ncccc2n2c(=O)n(-c3ccccc3)nc12)c1ccccc1,5.65,2009.0,PreparationTutorialDataset_0002,5.65,Nope
PreparationTutorialDataset_0003,CNC(=O)C12CC1C(n1cnc3c(NCc4cccc(Cl)c4)nc(C#CCC...,5.45,2009.0,PreparationTutorialDataset_0003,5.45,Wow
PreparationTutorialDataset_0004,CCCn1c(=O)c2c(nc3cc(OC)ccn32)n(CCCNC(=O)c2ccc(...,5.2,2019.0,PreparationTutorialDataset_0004,5.2,Nope


In [6]:
from qsprpred.data.processing.data_filters import RepeatsFilter, CategoryFilter

# calculate compound features and split dataset into train and test
dataset.prepareDataset(
    data_filters=[RepeatsFilter(keep=False),
                  CategoryFilter(name="FakeProperty", values=["Wow"], keep=True)],
    # only keep compounds with FakeProperty="Wow"
    split=RandomSplit(test_fraction=0.2, dataset=dataset),
    feature_calculators=[MorganFP(radius=3, nBits=2048)],
    recalculate_features=True,
)

dataset.getDF().head()

Unnamed: 0_level_0,SMILES,pchembl_value_Mean,Year,QSPRID,pchembl_value_Mean_original,FakeProperty
QSPRID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
PreparationTutorialDataset_0000,Cc1cc(C)n(-c2cc(NC(=O)CCN(C)C)nc(-c3ccc(C)o3)n...,8.68,2008.0,PreparationTutorialDataset_0000,8.68,Wow
PreparationTutorialDataset_0003,CNC(=O)C12CC1C(n1cnc3c(NCc4cccc(Cl)c4)nc(C#CCC...,5.45,2009.0,PreparationTutorialDataset_0003,5.45,Wow
PreparationTutorialDataset_0005,Cn1c(-n2nccn2)nc2c(N)nc(CCc3ccccc3)nc21,8.33,2005.0,PreparationTutorialDataset_0005,8.33,Wow
PreparationTutorialDataset_0006,Nc1nc(-c2ccccc2)cn2cc(-c3ccco3)nc12,6.28,2017.0,PreparationTutorialDataset_0006,6.28,Wow
PreparationTutorialDataset_0008,N#Cc1c(-c2ccccc2)cc(-c2ccco2)nc1N,7.96,2008.0,PreparationTutorialDataset_0008,7.96,Wow


## Filling missing features

After feature calculation, some features might have missing values (not relevant for Morgan fingerprints).
By default, these are left as NaN, but you can also specify a value to fill them with.

In [7]:
# calculate compound features and split dataset into train and test
dataset.prepareDataset(
    split=RandomSplit(test_fraction=0.2, dataset=dataset),
    feature_calculators=[MorganFP(radius=3, nBits=2048)],
    feature_fill_value=5,  # fill missing values with 5
    recalculate_features=True,
)

## Feature selection

After feature calculation, you might want to select a subset of features to use for modelling.
There are a number of feature selection methods available in QSPRpred, which can be used by passing a list of feature selection objects to the `feature_filters` argument.

You can find an overview of available feature selection methods in the [documentation](https://cddleiden.github.io/QSPRpred/docs/api/qsprpred.data.utils.html#module-qsprpred.data.utils.featurefilters).


Here, we will use the `HighCorrelationFilter` to remove features that are highly correlated with each other.

In [8]:
from qsprpred.data.processing.feature_filters import HighCorrelationFilter

# calculate compound features and split dataset into train and test
dataset.prepareDataset(
    split=RandomSplit(test_fraction=0.2, dataset=dataset),
    feature_calculators=[MorganFP(radius=3, nBits=2048)],
    recalculate_features=True,
    feature_filters=[HighCorrelationFilter(th=0.95)]
    # remove features with correlation > 0.95
)

print(f"Number of fingerprint bits after filtering: {len(dataset.getDescriptors())}")

Number of fingerprint bits after filtering: 2074


## Feature standardization

You can also specify scikit-learn feature standardization methods to apply to the features.
In this example we will use the scikit-learn `StandardScaler` to standardize the features.
Note. that this is not useful for Morgan fingerprints, but we will use it here for demonstration purposes.

In [9]:
from qsprpred.data.processing.feature_standardizers import SKLearnStandardizer
from sklearn.preprocessing import StandardScaler

# calculate compound features and split dataset into train and test
dataset.prepareDataset(
    split=RandomSplit(test_fraction=0.2, dataset=dataset),
    feature_calculators=[MorganFP(radius=3, nBits=2048)],
    recalculate_features=True,
    feature_standardizer=SKLearnStandardizer(StandardScaler())  # standardize features
)

Now that you know how to use the different data preparation options, you can start preparing your own datasets.
If you want to collect your open-source data for your own project, we recommend you check out the [data collection tutorial](data_collection_with_papyrus.ipynb). This tutorial covers how to easily collect data from [Papyrus](https://jcheminf.biomedcentral.com/articles/10.1186/s13321-022-00672-x), a large-scale curated dataset aimed at bioactivity predictions. Of course, you can also use your own data, or data from other sources.

If you have finished preparing your data set, and you are ready to start modelling, you can check out the [model assessment tutorial](model_assessment.ipynb) to learn more on how to train models in QSPRpred or the [classification tutorial](../modelling/classification.ipynb) if you want to learn more about classification modelling in QSPRpred.