In [1]:
# suppress numpy RuntimeWarning
import warnings

warnings.filterwarnings("ignore", category=RuntimeWarning)

# Data preparation

In this tutorial we will go deeper into the different data preparation options that are possible within QSPRpred.

The first step is to load the data and wrap it into a QSPRpred dataset object.
Here we will create a regression dataset, but if you want to learn more about how to prepare
classification data, please check the [classification tutorial](../modelling/classification.ipynb).
If you want to know more about how to specifiy the target property,
you can have a look at the [documentation](https://cddleiden.github.io/QSPRpred/docs/api/qsprpred.data.html#qsprpred.data.data.TargetProperty) on this topic.

In [2]:
import os

from qsprpred.data import QSPRTable

os.makedirs("../../tutorial_output/data", exist_ok=True)

dataset = QSPRTable.fromTableFile(
    filename="../../tutorial_data/A2A_LIGANDS.tsv",
    path="../../tutorial_output/data",
    name="PreparationTutorialDataset",
    target_props=[{"name": "pchembl_value_Mean", "task": "REGRESSION"}],
)
dataset.randomState = 42

dataset.getDF()

Unnamed: 0_level_0,SMILES,pchembl_value_Mean,Year,original_smiles,ID,ID_before_change,pchembl_value_Mean_original
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
PreparationTutorialDataset_storage_library_0000,Cc1nn(-c2cc(NC(=O)CCN(C)C)nc(-c3ccc(C)o3)n2)c(...,8.68,2008.0,Cc1nn(-c2cc(NC(=O)CCN(C)C)nc(-c3ccc(C)o3)n2)c(...,PreparationTutorialDataset_storage_library_0000,PreparationTutorialDataset_storage_library_0000,8.68
PreparationTutorialDataset_storage_library_0001,Nc1c(C(=O)Nc2ccc([N+](=O)[O-])cc2)sc2c1cc1CCCC...,4.82,2010.0,Nc1c(C(=O)Nc2ccc([N+](=O)[O-])cc2)sc2c1cc1CCCC...,PreparationTutorialDataset_storage_library_0001,PreparationTutorialDataset_storage_library_0001,4.82
PreparationTutorialDataset_storage_library_0002,O=C(Nc1nc2ncccc2n2c(=O)n(-c3ccccc3)nc12)c1ccccc1,5.65,2009.0,O=C(Nc1nc2ncccc2n2c(=O)n(-c3ccccc3)nc12)c1ccccc1,PreparationTutorialDataset_storage_library_0002,PreparationTutorialDataset_storage_library_0002,5.65
PreparationTutorialDataset_storage_library_0003,CNC(=O)C12CC1C(n1cnc3c1nc(C#CCCCCC(=O)OC)nc3NC...,5.45,2009.0,CNC(=O)C12CC1C(n1cnc3c1nc(C#CCCCCC(=O)OC)nc3NC...,PreparationTutorialDataset_storage_library_0003,PreparationTutorialDataset_storage_library_0003,5.45
PreparationTutorialDataset_storage_library_0004,CCCn1c(=O)c2c(nc3cc(OC)ccn32)n(CCCNC(=O)c2ccc(...,5.20,2019.0,CCCn1c(=O)c2c(nc3cc(OC)ccn32)n(CCCNC(=O)c2ccc(...,PreparationTutorialDataset_storage_library_0004,PreparationTutorialDataset_storage_library_0004,5.20
...,...,...,...,...,...,...,...
PreparationTutorialDataset_storage_library_4077,CNc1ncc(C(=O)NCc2ccc(OC)cc2)c2nc(-c3ccco3)nn12,7.09,2018.0,CNc1ncc(C(=O)NCc2ccc(OC)cc2)c2nc(-c3ccco3)nn12,PreparationTutorialDataset_storage_library_4077,PreparationTutorialDataset_storage_library_4077,7.09
PreparationTutorialDataset_storage_library_4078,Nc1nc(-c2ccco2)c2ncn(C(=O)NCCc3ccccc3)c2n1,8.22,2008.0,Nc1nc(-c2ccco2)c2ncn(C(=O)NCCc3ccccc3)c2n1,PreparationTutorialDataset_storage_library_4078,PreparationTutorialDataset_storage_library_4078,8.22
PreparationTutorialDataset_storage_library_4079,Nc1nc(Nc2ccc(F)cc2)nc(CSc2nnc(N)s2)n1,4.89,2010.0,Nc1nc(Nc2ccc(F)cc2)nc(CSc2nnc(N)s2)n1,PreparationTutorialDataset_storage_library_4079,PreparationTutorialDataset_storage_library_4079,4.89
PreparationTutorialDataset_storage_library_4080,CCCOc1ccc(C=Cc2cc3c(c(=O)n(C)c(=O)n3C)n2C)cc1,6.51,2013.0,CCCOc1ccc(C=Cc2cc3c(c(=O)n(C)c(=O)n3C)n2C)cc1,PreparationTutorialDataset_storage_library_4080,PreparationTutorialDataset_storage_library_4080,6.51


You might have seen the preprocessing steps below in the quick_start or one of the other tutorials.
However, there are many more preprocessing options available in QSPRpred.

The `QSPRData.prepareDataset` function allows you to specify a number of preprocessing steps,
which are then applied in a fixed order, although you can also use the individual preprocessing steps if
you want to have more control.

In this tutorial we will go through the different preprocessing steps that are available in QSPRpred.
Some have their own dedicated tutorial, which will be linked to below.

The preprocessing steps that `QSPRData.prepareDataset` applies are:

1. [feature calculation](descriptors.ipynb)
2. data filtering
3. fill missing feature values
4. [split into training and test set](data_splitting.ipynb)
5. feature selection
6. feature standardization
7. [outlier removal/applicability domain](applicability_domain.ipynb)

**Note:** Note that in this tutorial we already assume the SMILES are standardized, which they are since they come from the Papyrus database. Therefore, we do not apply a standardizer before calculating features. If you would like to apply custom standardization, you can find out how in the [data representation tutorial](data_representation.ipynb).

In [3]:
from qsprpred.data.descriptors.fingerprints import MorganFP
from qsprpred.data.sampling.splits import RandomSplit

# calculate compound features and split dataset into train and test
dataset.prepareDataset(
    split=RandomSplit(test_fraction=0.2, dataset=dataset),
    feature_calculators=[MorganFP(radius=3, nBits=2048)],
    recalculate_features=True,
)

dataset.getDF().head()

The following rows contain duplicates: [array(['PreparationTutorialDataset_storage_library_3584',
       'PreparationTutorialDataset_storage_library_4020'], dtype=object)]


Unnamed: 0_level_0,SMILES,pchembl_value_Mean,Year,original_smiles,ID,ID_before_change,pchembl_value_Mean_original
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
PreparationTutorialDataset_storage_library_0000,Cc1nn(-c2cc(NC(=O)CCN(C)C)nc(-c3ccc(C)o3)n2)c(...,8.68,2008.0,Cc1nn(-c2cc(NC(=O)CCN(C)C)nc(-c3ccc(C)o3)n2)c(...,PreparationTutorialDataset_storage_library_0000,PreparationTutorialDataset_storage_library_0000,8.68
PreparationTutorialDataset_storage_library_0001,Nc1c(C(=O)Nc2ccc([N+](=O)[O-])cc2)sc2c1cc1CCCC...,4.82,2010.0,Nc1c(C(=O)Nc2ccc([N+](=O)[O-])cc2)sc2c1cc1CCCC...,PreparationTutorialDataset_storage_library_0001,PreparationTutorialDataset_storage_library_0001,4.82
PreparationTutorialDataset_storage_library_0002,O=C(Nc1nc2ncccc2n2c(=O)n(-c3ccccc3)nc12)c1ccccc1,5.65,2009.0,O=C(Nc1nc2ncccc2n2c(=O)n(-c3ccccc3)nc12)c1ccccc1,PreparationTutorialDataset_storage_library_0002,PreparationTutorialDataset_storage_library_0002,5.65
PreparationTutorialDataset_storage_library_0003,CNC(=O)C12CC1C(n1cnc3c1nc(C#CCCCCC(=O)OC)nc3NC...,5.45,2009.0,CNC(=O)C12CC1C(n1cnc3c1nc(C#CCCCCC(=O)OC)nc3NC...,PreparationTutorialDataset_storage_library_0003,PreparationTutorialDataset_storage_library_0003,5.45
PreparationTutorialDataset_storage_library_0004,CCCn1c(=O)c2c(nc3cc(OC)ccn32)n(CCCNC(=O)c2ccc(...,5.2,2019.0,CCCn1c(=O)c2c(nc3cc(OC)ccn32)n(CCCNC(=O)c2ccc(...,PreparationTutorialDataset_storage_library_0004,PreparationTutorialDataset_storage_library_0004,5.2


You can see that in this case you get a warning about potential duplicates in the dataset. Let's investigate this further:

In [4]:
duplicate_ids = [
    "PreparationTutorialDataset_storage_library_3584",
    "PreparationTutorialDataset_storage_library_4020"
]
dataset.getProperty(dataset.smilesProp, ids=duplicate_ids)

ID
PreparationTutorialDataset_storage_library_3584    CCNC(=O)C1OC(n2cnc3c2ncnc3NCCCCCCCCNc2ccc([N+]...
PreparationTutorialDataset_storage_library_4020    CCNC(=O)C1OC(n2cnc3c2ncnc3NCCCCCCCCCCNc2ccc([N...
Name: SMILES, dtype: object

It seems that we do not have duplicated SMILES. This makes sense, as the SMILES are standardized in the Papyrus database and should be unique by default. However, we can still check for duplicates in the descriptor space as suggested by the message:

In [5]:
subset = dataset.getSubset(dataset.getProperties(),
                           ids=duplicate_ids)
subset.getDescriptors()

Unnamed: 0_level_0,MorganFP_MorganFP_0,MorganFP_MorganFP_1,MorganFP_MorganFP_2,MorganFP_MorganFP_3,MorganFP_MorganFP_4,MorganFP_MorganFP_5,MorganFP_MorganFP_6,MorganFP_MorganFP_7,MorganFP_MorganFP_8,MorganFP_MorganFP_9,...,MorganFP_MorganFP_2038,MorganFP_MorganFP_2039,MorganFP_MorganFP_2040,MorganFP_MorganFP_2041,MorganFP_MorganFP_2042,MorganFP_MorganFP_2043,MorganFP_MorganFP_2044,MorganFP_MorganFP_2045,MorganFP_MorganFP_2046,MorganFP_MorganFP_2047
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
PreparationTutorialDataset_storage_library_3584,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,True
PreparationTutorialDataset_storage_library_4020,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,True


Indeed, it looks like we have duplicates in the descriptor space. We can deal with this by adding more descriptors to make the representations unique or we can apply filtering to remove these duplicates. We will elect to do the latter in the next section.

## Data filtering

A number of filters can be applied to the data to remove unwanted compounds.
By default, a filter to remove duplicates is applied (`RepeatsFilter`). It removes compounds that have duplicate descriptors as we already discovered above.

A number of other filters is available in QSPRpred, which can be used by passing a list of filter objects to the `filters` argument.
You can find all available filters in the [documentation](https://cddleiden.github.io/QSPRpred/docs/api/qsprpred.data.utils.html#module-qsprpred.data.utils.data_filters).

Here, we will use the `RepeatsFilter` and the `CategoryFilter` to remove duplicates and compounds based on the value of a categorical feature.

In [6]:
import numpy as np

dataset.addProperty(name="FakeProperty",
                    data=np.random.choice(["Wow", "Nope"], len(dataset)))
dataset.getDF().head()

Unnamed: 0_level_0,SMILES,pchembl_value_Mean,Year,original_smiles,ID,ID_before_change,pchembl_value_Mean_original,FakeProperty
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
PreparationTutorialDataset_storage_library_0000,Cc1nn(-c2cc(NC(=O)CCN(C)C)nc(-c3ccc(C)o3)n2)c(...,8.68,2008.0,Cc1nn(-c2cc(NC(=O)CCN(C)C)nc(-c3ccc(C)o3)n2)c(...,PreparationTutorialDataset_storage_library_0000,PreparationTutorialDataset_storage_library_0000,8.68,Wow
PreparationTutorialDataset_storage_library_0001,Nc1c(C(=O)Nc2ccc([N+](=O)[O-])cc2)sc2c1cc1CCCC...,4.82,2010.0,Nc1c(C(=O)Nc2ccc([N+](=O)[O-])cc2)sc2c1cc1CCCC...,PreparationTutorialDataset_storage_library_0001,PreparationTutorialDataset_storage_library_0001,4.82,Wow
PreparationTutorialDataset_storage_library_0002,O=C(Nc1nc2ncccc2n2c(=O)n(-c3ccccc3)nc12)c1ccccc1,5.65,2009.0,O=C(Nc1nc2ncccc2n2c(=O)n(-c3ccccc3)nc12)c1ccccc1,PreparationTutorialDataset_storage_library_0002,PreparationTutorialDataset_storage_library_0002,5.65,Nope
PreparationTutorialDataset_storage_library_0003,CNC(=O)C12CC1C(n1cnc3c1nc(C#CCCCCC(=O)OC)nc3NC...,5.45,2009.0,CNC(=O)C12CC1C(n1cnc3c1nc(C#CCCCCC(=O)OC)nc3NC...,PreparationTutorialDataset_storage_library_0003,PreparationTutorialDataset_storage_library_0003,5.45,Wow
PreparationTutorialDataset_storage_library_0004,CCCn1c(=O)c2c(nc3cc(OC)ccn32)n(CCCNC(=O)c2ccc(...,5.2,2019.0,CCCn1c(=O)c2c(nc3cc(OC)ccn32)n(CCCNC(=O)c2ccc(...,PreparationTutorialDataset_storage_library_0004,PreparationTutorialDataset_storage_library_0004,5.2,Wow


You should see no warning this time:

In [7]:
from qsprpred.data.processing.data_filters import RepeatsFilter, CategoryFilter

# calculate compound features and split dataset into train and test
dataset.prepareDataset(
    data_filters=[RepeatsFilter(keep=False),
                  CategoryFilter(name="FakeProperty", values=["Wow"], keep=True)],
    # only keep compounds with FakeProperty="Wow"
    split=RandomSplit(test_fraction=0.2, dataset=dataset),
    feature_calculators=[MorganFP(radius=3, nBits=2048)],
    recalculate_features=True,
)

dataset.getDF().head()

Unnamed: 0_level_0,SMILES,pchembl_value_Mean,Year,original_smiles,ID,ID_before_change,pchembl_value_Mean_original,FakeProperty
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
PreparationTutorialDataset_storage_library_0000,Cc1nn(-c2cc(NC(=O)CCN(C)C)nc(-c3ccc(C)o3)n2)c(...,8.68,2008.0,Cc1nn(-c2cc(NC(=O)CCN(C)C)nc(-c3ccc(C)o3)n2)c(...,PreparationTutorialDataset_storage_library_0000,PreparationTutorialDataset_storage_library_0000,8.68,Wow
PreparationTutorialDataset_storage_library_0001,Nc1c(C(=O)Nc2ccc([N+](=O)[O-])cc2)sc2c1cc1CCCC...,4.82,2010.0,Nc1c(C(=O)Nc2ccc([N+](=O)[O-])cc2)sc2c1cc1CCCC...,PreparationTutorialDataset_storage_library_0001,PreparationTutorialDataset_storage_library_0001,4.82,Wow
PreparationTutorialDataset_storage_library_0003,CNC(=O)C12CC1C(n1cnc3c1nc(C#CCCCCC(=O)OC)nc3NC...,5.45,2009.0,CNC(=O)C12CC1C(n1cnc3c1nc(C#CCCCCC(=O)OC)nc3NC...,PreparationTutorialDataset_storage_library_0003,PreparationTutorialDataset_storage_library_0003,5.45,Wow
PreparationTutorialDataset_storage_library_0004,CCCn1c(=O)c2c(nc3cc(OC)ccn32)n(CCCNC(=O)c2ccc(...,5.2,2019.0,CCCn1c(=O)c2c(nc3cc(OC)ccn32)n(CCCNC(=O)c2ccc(...,PreparationTutorialDataset_storage_library_0004,PreparationTutorialDataset_storage_library_0004,5.2,Wow
PreparationTutorialDataset_storage_library_0005,Cn1c(-n2nccn2)nc2c1nc(CCc1ccccc1)nc2N,8.33,2005.0,Cn1c(-n2nccn2)nc2c1nc(CCc1ccccc1)nc2N,PreparationTutorialDataset_storage_library_0005,PreparationTutorialDataset_storage_library_0005,8.33,Wow


This also reduces the size of our data set overall:

In [8]:
len(dataset)

2017

## Filling missing features

After feature calculation, some features might have missing values (not relevant for Morgan fingerprints).
By default, these are left as NaN, but you can also specify a value to fill them with.

In [9]:
# calculate compound features and split dataset into train and test
dataset.prepareDataset(
    split=RandomSplit(test_fraction=0.2, dataset=dataset),
    feature_calculators=[MorganFP(radius=3, nBits=2048)],
    feature_fill_value=5,  # fill missing values with 5
    recalculate_features=True,
)

**Note:** This part of the API will still be changed and improved to allow more flexible value imputation schemes.

## Feature selection

After feature calculation, you might want to select a subset of features to use for modelling.
There are a number of feature selection methods available in QSPRpred, which can be used by passing a list of feature selection objects to the `feature_filters` argument.

You can find an overview of available feature selection methods in the [documentation](https://cddleiden.github.io/QSPRpred/docs/api/qsprpred.data.utils.html#module-qsprpred.data.utils.featurefilters).


Here, we will use the `HighCorrelationFilter` to remove features that are highly correlated with each other.

In [10]:
dataset.getFeatures(concat=True).shape

(2017, 2048)

In [11]:
from qsprpred.data.processing.feature_filters import HighCorrelationFilter

# calculate compound features and split dataset into train and test
dataset.prepareDataset(
    split=RandomSplit(test_fraction=0.2, dataset=dataset),
    feature_calculators=[MorganFP(radius=3, nBits=2048)],
    recalculate_features=True,
    feature_filters=[HighCorrelationFilter(th=0.95)]
    # remove features with correlation > 0.95
)

print(f"Number of fingerprint bits after filtering: {len(dataset.getDescriptors())}")

Number of fingerprint bits after filtering: 2017


  c /= stddev[:, None]
  c /= stddev[None, :]


In [12]:
dataset.getFeatures(concat=True).shape

(2017, 2042)

## Feature standardization

You can also specify scikit-learn feature standardization methods to apply to the features.
In this example we will use the scikit-learn `StandardScaler` to standardize the features.
Note. that this is not useful for Morgan fingerprints, but we will use it here for demonstration purposes.

In [13]:
from qsprpred.data.processing.feature_standardizers import SKLearnStandardizer
from sklearn.preprocessing import StandardScaler

# calculate compound features and split dataset into train and test
dataset.prepareDataset(
    split=RandomSplit(test_fraction=0.2, dataset=dataset),
    feature_calculators=[MorganFP(radius=3, nBits=2048)],
    recalculate_features=True,
    feature_standardizer=SKLearnStandardizer(StandardScaler())  # standardize features
)
dataset.getFeatures(concat=True)  # no longer binary

Unnamed: 0_level_0,MorganFP_MorganFP_0,MorganFP_MorganFP_1,MorganFP_MorganFP_2,MorganFP_MorganFP_3,MorganFP_MorganFP_4,MorganFP_MorganFP_5,MorganFP_MorganFP_6,MorganFP_MorganFP_7,MorganFP_MorganFP_8,MorganFP_MorganFP_9,...,MorganFP_MorganFP_2038,MorganFP_MorganFP_2039,MorganFP_MorganFP_2040,MorganFP_MorganFP_2041,MorganFP_MorganFP_2042,MorganFP_MorganFP_2043,MorganFP_MorganFP_2044,MorganFP_MorganFP_2045,MorganFP_MorganFP_2046,MorganFP_MorganFP_2047
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
PreparationTutorialDataset_storage_library_1467,-0.070587,-0.296510,-0.132887,-0.070587,-0.218614,-0.077363,-0.122874,-0.608477,-0.140417,-0.173588,...,-0.600067,-0.114275,-0.107399,-0.112028,-0.116481,-0.126968,-0.196520,-0.159435,-0.225983,-0.182481
PreparationTutorialDataset_storage_library_2718,-0.070587,-0.296510,-0.132887,-0.070587,4.574266,-0.077363,-0.122874,-0.608477,-0.140417,-0.173588,...,-0.600067,-0.114275,-0.107399,-0.112028,-0.116481,-0.126968,5.088549,-0.159435,-0.225983,-0.182481
PreparationTutorialDataset_storage_library_0462,-0.070587,-0.296510,-0.132887,-0.070587,-0.218614,-0.077363,-0.122874,-0.608477,-0.140417,-0.173588,...,-0.600067,-0.114275,-0.107399,-0.112028,-0.116481,-0.126968,-0.196520,-0.159435,-0.225983,-0.182481
PreparationTutorialDataset_storage_library_3312,-0.070587,-0.296510,-0.132887,-0.070587,-0.218614,-0.077363,-0.122874,-0.608477,-0.140417,-0.173588,...,-0.600067,-0.114275,-0.107399,-0.112028,-0.116481,-0.126968,-0.196520,6.272161,-0.225983,-0.182481
PreparationTutorialDataset_storage_library_2874,-0.070587,-0.296510,-0.132887,-0.070587,-0.218614,-0.077363,-0.122874,-0.608477,-0.140417,-0.173588,...,1.666479,-0.114275,-0.107399,-0.112028,-0.116481,-0.126968,-0.196520,-0.159435,-0.225983,-0.182481
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
PreparationTutorialDataset_storage_library_2188,-0.070587,-0.296510,-0.132887,-0.070587,-0.218614,-0.077363,-0.122874,-0.608477,-0.140417,-0.173588,...,-0.600067,-0.114275,-0.107399,-0.112028,-0.116481,-0.126968,-0.196520,6.272161,4.425114,-0.182481
PreparationTutorialDataset_storage_library_1362,-0.070587,-0.296510,-0.132887,-0.070587,-0.218614,-0.077363,-0.122874,-0.608477,-0.140417,-0.173588,...,1.666479,-0.114275,-0.107399,-0.112028,-0.116481,-0.126968,-0.196520,-0.159435,-0.225983,-0.182481
PreparationTutorialDataset_storage_library_2951,-0.070587,3.372571,-0.132887,-0.070587,-0.218614,-0.077363,-0.122874,-0.608477,-0.140417,-0.173588,...,1.666479,-0.114275,-0.107399,-0.112028,-0.116481,-0.126968,-0.196520,-0.159435,-0.225983,-0.182481
PreparationTutorialDataset_storage_library_1849,-0.070587,3.372571,-0.132887,-0.070587,-0.218614,-0.077363,-0.122874,-0.608477,-0.140417,-0.173588,...,1.666479,-0.114275,-0.107399,-0.112028,-0.116481,-0.126968,-0.196520,-0.159435,-0.225983,-0.182481


Now that you know how to use the different data preparation options, you can start preparing your own datasets.
If you want to collect your open-source data for your own project, we recommend you check out the [data collection tutorial](data_collection_with_papyrus.ipynb). This tutorial covers how to easily collect data from [Papyrus](https://jcheminf.biomedcentral.com/articles/10.1186/s13321-022-00672-x), a large-scale curated dataset aimed at bioactivity predictions. Of course, you can also use your own data, or data from other sources.

If you have finished preparing your data set, and you are ready to start modelling, you can check out the [model assessment tutorial](model_assessment.ipynb) to learn more on how to train models in QSPRpred or the [classification tutorial](../modelling/classification.ipynb) if you want to learn more about classification modelling in QSPRpred.