# Data Preparation

In this tutorial, you will learn how to prepare data sets with QSPRPred.

## Data Representation (`PandasDataSet`)

The package basically uses wrapped `pandas.DataFrame` objects with some useful functions added on top to facilitate features relevant for QSPR modeling. The `PandasDataSet` class is the  base class of all data sets in QSPRPred. Wrapping a `pandas.DataFrame` is easy:

In [1]:
# load a sample data set on parkinson's disease
import pandas as pd

df = pd.read_table('data/parkinsons_pivot.tsv')
df

Unnamed: 0,SMILES,GABAAalpha,NMDA,P41594,Q13255,Q14643
0,Brc1cc(-c2nc(-c3ncccc3)no2)ccc1,,,6.93,,
1,Brc1ccc2c(c1)-c1ncnn1Cc1c(-c3cccs3)ncn-21,8.400000,,,,
2,Brc1ccc2c(c1)-c1nncn1Cc1c(I)ncn-21,8.110000,,,,
3,Brc1cccc(-c2cc(-c3ccccc3)nnc2)c1,8.013333,,,,
4,Brc1cccc(-c2cnnc(NCc3ccccc3)c2)c1,7.505000,,,,
...,...,...,...,...,...,...
6220,c1cnc(COc2nn3c(C4CCC4)nnc3c3c2cccc3)cc1,7.700000,,,,
6221,c1cnc(COc2nn3c(N4CCOCC4)nnc3c3c2C2CCC3CC2)cc1,6.000000,,,,
6222,c1cnc(N2CCc3c(cccc3)C2)cc1,,6.12,,,
6223,c1cnc(Oc2cccc(-n3nnc(-c4ncccc4)n3)c2)cc1,,,5.54,,


In [2]:
from qsprpred.data.data import PandasDataSet

ds = PandasDataSet(df=df, store_dir="data", name="parkinsons")
ds

<qsprpred.data.data.PandasDataSet at 0x7fdf81f85180>

You can query this data set directly for simple information like the number of samples:

In [3]:
len(ds)

6225

the saved properties/features:

In [4]:
ds.getProperties()

Index(['SMILES', 'GABAAalpha', 'NMDA', 'P41594', 'Q13255', 'Q14643', 'QSPRID'], dtype='object')

You can also do some operations on the data frame, like shuffle it:

In [5]:
ds.shuffle()

or drop some columns:

In [6]:
ds.removeProperty("Q14643")

However, you can always access the underlying data frame if more complex operations are needed:

In [7]:
df = ds.getDF()
df

Unnamed: 0_level_0,SMILES,GABAAalpha,NMDA,P41594,Q13255,QSPRID
QSPRID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
parkinsons_1877,COC(=O)CCC#Cc1ccc(C2SCC(C(C)(C)C)CS2)cc1,7.77,,,,parkinsons_1877
parkinsons_6133,c1ccc(-c2nnc3c4c(cccc4)c(OCC4CCCCC4)nn23)cc1,6.00,,,,parkinsons_6133
parkinsons_5545,O=C1N(c2ccc(C#Cc3cccnc3)cn2)CC2CCCCN12,,,7.130,,parkinsons_5545
parkinsons_4856,O=C(C=Cc1cc(Cl)ccc1NC(=O)c1ccccn1)c1ccccc1,,,5.770,,parkinsons_4856
parkinsons_4639,NC(CC1CCCCC1CCP(=O)(O)O)C(=O)O,,6.06,,,parkinsons_4639
...,...,...,...,...,...,...
parkinsons_3770,Cc1nc(C#Cc2ccncc2)cs1,,,6.900,,parkinsons_3770
parkinsons_2622,Cc1c(C(=O)Nc2cc(Cl)c(NC(=O)c3ccccc3)cc2)occ1,,,,5.0,parkinsons_2622
parkinsons_4073,Cn1c2c(CCN(C(=O)c3nn4cc(Br)cnc4c3)CC2)c2c1cccc2,,,6.260,,parkinsons_4073
parkinsons_5345,O=C(c1ccc(Nc2ccc(Cl)cc2)nc1)N1CCCC1,,,5.270,,parkinsons_5345


It is always possible to wrap it again:

In [8]:
ds = PandasDataSet(df=df, store_dir="data", name="parkinsons")
ds

<qsprpred.data.data.PandasDataSet at 0x7fdeb5f5d570>

In [9]:
ds.getDF()

Unnamed: 0_level_0,SMILES,GABAAalpha,NMDA,P41594,Q13255,QSPRID
QSPRID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
parkinsons_0,COC(=O)CCC#Cc1ccc(C2SCC(C(C)(C)C)CS2)cc1,7.77,,,,parkinsons_0
parkinsons_1,c1ccc(-c2nnc3c4c(cccc4)c(OCC4CCCCC4)nn23)cc1,6.00,,,,parkinsons_1
parkinsons_2,O=C1N(c2ccc(C#Cc3cccnc3)cn2)CC2CCCCN12,,,7.130,,parkinsons_2
parkinsons_3,O=C(C=Cc1cc(Cl)ccc1NC(=O)c1ccccn1)c1ccccc1,,,5.770,,parkinsons_3
parkinsons_4,NC(CC1CCCCC1CCP(=O)(O)O)C(=O)O,,6.06,,,parkinsons_4
...,...,...,...,...,...,...
parkinsons_6220,Cc1nc(C#Cc2ccncc2)cs1,,,6.900,,parkinsons_6220
parkinsons_6221,Cc1c(C(=O)Nc2cc(Cl)c(NC(=O)c3ccccc3)cc2)occ1,,,,5.0,parkinsons_6221
parkinsons_6222,Cn1c2c(CCN(C(=O)c3nn4cc(Br)cnc4c3)CC2)c2c1cccc2,,,6.260,,parkinsons_6222
parkinsons_6223,O=C(c1ccc(Nc2ccc(Cl)cc2)nc1)N1CCCC1,,,5.270,,parkinsons_6223


### Data Indexing

You might have noticed that when recreated again from a new data frame the "QSPRID" column was reset while the data still remained in the same order as after shuffling. This is because index gets automatically reset when a new `PandasDataSet` object is created. However, you can always set the index to a specific column when creating the data set:

In [10]:
ds.shuffle()
df = ds.getDF()
df

Unnamed: 0_level_0,SMILES,GABAAalpha,NMDA,P41594,Q13255,QSPRID
QSPRID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
parkinsons_1579,O=C(O)C1CC(CP(=O)(O)O)CCN1,,6.535,,,parkinsons_1579
parkinsons_1504,N#Cc1cccc(-c2cc(-c3c(F)cc(F)cc3)nnc2)c1,7.705000,,,,parkinsons_1504
parkinsons_2040,O=C(Nc1cc(-c2ccccc2)nn1-c1ccccc1)c1c(O)cccc1,,,5.000,,parkinsons_2040
parkinsons_4409,CC(C)(C)c1nnc2CC(c3nc(-c4ccc(F)cc4)no3)CCn21,,,7.282,,parkinsons_4409
parkinsons_1090,Cc1c(C2=CCC(C(=O)N(C)C(C)C)CC2)nnn1-c1c(F)nccc1,,,,8.22,parkinsons_1090
...,...,...,...,...,...,...
parkinsons_3734,CCC(=O)N1CCc2c(C1)c1nc(-c3c(F)cccc3)cn1c(O)n2,6.973333,,,,parkinsons_3734
parkinsons_2008,Oc1ccc2[nH]c(-c3ccccc3)nc2c1CN1CCCCCC1,,6.200,,,parkinsons_2008
parkinsons_2984,CC(C)(C)c1cn(C2=NCC(=O)N3CCc4c(cccc4C4CCCO4)C3...,,,6.290,5.00,parkinsons_2984
parkinsons_525,Cc1cccc(-c2noc(C3CCN3C(=O)c3ccc(C#N)cc3)n2)c1,,,5.850,,parkinsons_525


In [11]:
ds = PandasDataSet(df=df, store_dir="data", name="parkinsons", index_cols=["QSPRID"])
ds.getDF()

Unnamed: 0_level_0,SMILES,GABAAalpha,NMDA,P41594,Q13255,QSPRID
QSPRID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
parkinsons_1579,O=C(O)C1CC(CP(=O)(O)O)CCN1,,6.535,,,parkinsons_1579
parkinsons_1504,N#Cc1cccc(-c2cc(-c3c(F)cc(F)cc3)nnc2)c1,7.705000,,,,parkinsons_1504
parkinsons_2040,O=C(Nc1cc(-c2ccccc2)nn1-c1ccccc1)c1c(O)cccc1,,,5.000,,parkinsons_2040
parkinsons_4409,CC(C)(C)c1nnc2CC(c3nc(-c4ccc(F)cc4)no3)CCn21,,,7.282,,parkinsons_4409
parkinsons_1090,Cc1c(C2=CCC(C(=O)N(C)C(C)C)CC2)nnn1-c1c(F)nccc1,,,,8.22,parkinsons_1090
...,...,...,...,...,...,...
parkinsons_3734,CCC(=O)N1CCc2c(C1)c1nc(-c3c(F)cccc3)cn1c(O)n2,6.973333,,,,parkinsons_3734
parkinsons_2008,Oc1ccc2[nH]c(-c3ccccc3)nc2c1CN1CCCCCC1,,6.200,,,parkinsons_2008
parkinsons_2984,CC(C)(C)c1cn(C2=NCC(=O)N3CCc4c(cccc4C4CCCO4)C3...,,,6.290,5.00,parkinsons_2984
parkinsons_525,Cc1cccc(-c2noc(C3CCN3C(=O)c3ccc(C#N)cc3)n2)c1,,,5.850,,parkinsons_525


Being aware of the index will help you track down the compounds and associated data further down the line so get used to it. You can always reset the index to a custom column as well:

In [12]:
ds.setIndex(["SMILES"])
ds.getDF().index

Index(['O=C(O)C1CC(CP(=O)(O)O)CCN1', 'N#Cc1cccc(-c2cc(-c3c(F)cc(F)cc3)nnc2)c1',
       'O=C(Nc1cc(-c2ccccc2)nn1-c1ccccc1)c1c(O)cccc1',
       'CC(C)(C)c1nnc2CC(c3nc(-c4ccc(F)cc4)no3)CCn21',
       'Cc1c(C2=CCC(C(=O)N(C)C(C)C)CC2)nnn1-c1c(F)nccc1',
       'Cc1noc(COc2nn3c(-c4ccccc4)nnc3c3c2C2CCC3CC2)n1',
       'CCCCc1nn2c(=O)cc(CN(CC)c3ccccc3)nc2s1',
       'O=C1Nc2c(cc(C#CCCN3CCC(Cc4ccccc4)CC3)cc2)C1=O',
       'Cc1ccc(NC(=O)c2nn(C)c(-c3ccc(F)cc3)c2C)nc1',
       'CCCc1c2Cn3ncnc3-c3c(ccc(Cl)c3)-n2cn1',
       ...
       'CCOc1cc(-c2ccccc2C#N)cnc1Nc1cccc(C)n1',
       'COc1ccc(C#Cc2ccc3C(=O)NCCc3c2)cc1',
       'Cc1c(COCc2ccc(C(N)=O)cn2)c(-c2ccc(F)cn2)no1',
       'Cc1ccc(-c2cccc3c2CCN2C(=O)CN=C(n4cnc(C5CC5)c4)C=C32)nc1',
       'Cc1noc(-c2nnc3c4c(c(OCc5nc(C)c(C)cc5)nn23)C2CCC4CC2)n1',
       'CCC(=O)N1CCc2c(C1)c1nc(-c3c(F)cccc3)cn1c(O)n2',
       'Oc1ccc2[nH]c(-c3ccccc3)nc2c1CN1CCCCCC1',
       'CC(C)(C)c1cn(C2=NCC(=O)N3CCc4c(cccc4C4CCCO4)C3=C2)cn1',
       'Cc1cccc(-c2noc(C3CCN3C(=O)

or even use multiple columns as index:

In [13]:
ds.setIndex(["SMILES", "QSPRID"])
ds.getDF().index

MultiIndex([(                             'O=C(O)C1CC(CP(=O)(O)O)CCN1', ...),
            (                'N#Cc1cccc(-c2cc(-c3c(F)cc(F)cc3)nnc2)c1', ...),
            (           'O=C(Nc1cc(-c2ccccc2)nn1-c1ccccc1)c1c(O)cccc1', ...),
            (           'CC(C)(C)c1nnc2CC(c3nc(-c4ccc(F)cc4)no3)CCn21', ...),
            (        'Cc1c(C2=CCC(C(=O)N(C)C(C)C)CC2)nnn1-c1c(F)nccc1', ...),
            (         'Cc1noc(COc2nn3c(-c4ccccc4)nnc3c3c2C2CCC3CC2)n1', ...),
            (                  'CCCCc1nn2c(=O)cc(CN(CC)c3ccccc3)nc2s1', ...),
            (          'O=C1Nc2c(cc(C#CCCN3CCC(Cc4ccccc4)CC3)cc2)C1=O', ...),
            (             'Cc1ccc(NC(=O)c2nn(C)c(-c3ccc(F)cc3)c2C)nc1', ...),
            (                   'CCCc1c2Cn3ncnc3-c3c(ccc(Cl)c3)-n2cn1', ...),
            ...
            (                  'CCOc1cc(-c2ccccc2C#N)cnc1Nc1cccc(C)n1', ...),
            (                      'COc1ccc(C#Cc2ccc3C(=O)NCCc3c2)cc1', ...),
            (            'Cc1c(COCc2ccc(C(N)=O)c

### Parallelization

The `PandasDataSet` class also provides a simple way to parallelize operations on the data frame. For example, you can easily apply a function to all rows of the data frame in parallel. All you need to do is initialize the `PandasDataSet` object with the number of CPUs you want to use:

In [14]:
ds = PandasDataSet(df=df, store_dir="data", name="parkinsons", index_cols=["QSPRID"], n_jobs=2)
df = ds.getDF()
df

Unnamed: 0_level_0,SMILES,GABAAalpha,NMDA,P41594,Q13255,QSPRID
QSPRID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
parkinsons_1579,O=C(O)C1CC(CP(=O)(O)O)CCN1,,6.535,,,parkinsons_1579
parkinsons_1504,N#Cc1cccc(-c2cc(-c3c(F)cc(F)cc3)nnc2)c1,7.705000,,,,parkinsons_1504
parkinsons_2040,O=C(Nc1cc(-c2ccccc2)nn1-c1ccccc1)c1c(O)cccc1,,,5.000,,parkinsons_2040
parkinsons_4409,CC(C)(C)c1nnc2CC(c3nc(-c4ccc(F)cc4)no3)CCn21,,,7.282,,parkinsons_4409
parkinsons_1090,Cc1c(C2=CCC(C(=O)N(C)C(C)C)CC2)nnn1-c1c(F)nccc1,,,,8.22,parkinsons_1090
...,...,...,...,...,...,...
parkinsons_3734,CCC(=O)N1CCc2c(C1)c1nc(-c3c(F)cccc3)cn1c(O)n2,6.973333,,,,parkinsons_3734
parkinsons_2008,Oc1ccc2[nH]c(-c3ccccc3)nc2c1CN1CCCCCC1,,6.200,,,parkinsons_2008
parkinsons_2984,CC(C)(C)c1cn(C2=NCC(=O)N3CCc4c(cccc4C4CCCO4)C3...,,,6.290,5.00,parkinsons_2984
parkinsons_525,Cc1cccc(-c2noc(C3CCN3C(=O)c3ccc(C#N)cc3)n2)c1,,,5.850,,parkinsons_525


Now, all operations that are parallelized will be done in parallel. One example is the `apply` method that allows you to apply a function over a subset of rows of the data frame:

In [15]:
import time

def add_one_slow(x):
    """ Emulate a slow function."""

    time.sleep(0.01)

    return f"{x[1]}+{x[0]}"

now = time.perf_counter()
ds.apply(add_one_slow, subset=["SMILES", "QSPRID"], axis=1)
print(time.perf_counter() - now)

Parallel apply in progress for parkinsons.:   0%|          | 0/4 [00:00<?, ?it/s]

33.12629931899937


Compare this with the time it takes to do the same thing with only one CPU:

In [16]:
ds.nJobs = 1
now = time.perf_counter()
ds.apply(add_one_slow, subset=["SMILES", "QSPRID"], axis=1)
print(time.perf_counter() - now)

63.852359679000074


### Saving and Loading

The `PandasDataSet` class also provides a simple way to save and load data sets. You can save the data set and any associated data to a directory. By default this is the current directory, but you can specify a different directory upon creation of the data set:

In [17]:
ds = PandasDataSet(df=df, store_dir="data", name="parkinsons_pandas", index_cols=["QSPRID"])
ds.save()

'data/parkinsons_pandas_df.pkl'

Reloading the data set is easy as well. We just use its name to initialize a new `PandasDataSet` object:

In [18]:
ds = PandasDataSet(name="parkinsons_pandas", store_dir="data")
ds.getDF()

Unnamed: 0_level_0,SMILES,GABAAalpha,NMDA,P41594,Q13255,QSPRID
QSPRID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
parkinsons_1579,O=C(O)C1CC(CP(=O)(O)O)CCN1,,6.535,,,parkinsons_1579
parkinsons_1504,N#Cc1cccc(-c2cc(-c3c(F)cc(F)cc3)nnc2)c1,7.705000,,,,parkinsons_1504
parkinsons_2040,O=C(Nc1cc(-c2ccccc2)nn1-c1ccccc1)c1c(O)cccc1,,,5.000,,parkinsons_2040
parkinsons_4409,CC(C)(C)c1nnc2CC(c3nc(-c4ccc(F)cc4)no3)CCn21,,,7.282,,parkinsons_4409
parkinsons_1090,Cc1c(C2=CCC(C(=O)N(C)C(C)C)CC2)nnn1-c1c(F)nccc1,,,,8.22,parkinsons_1090
...,...,...,...,...,...,...
parkinsons_3734,CCC(=O)N1CCc2c(C1)c1nc(-c3c(F)cccc3)cn1c(O)n2,6.973333,,,,parkinsons_3734
parkinsons_2008,Oc1ccc2[nH]c(-c3ccccc3)nc2c1CN1CCCCCC1,,6.200,,,parkinsons_2008
parkinsons_2984,CC(C)(C)c1cn(C2=NCC(=O)N3CCc4c(cccc4C4CCCO4)C3...,,,6.290,5.00,parkinsons_2984
parkinsons_525,Cc1cccc(-c2noc(C3CCN3C(=O)c3ccc(C#N)cc3)n2)c1,,,5.850,,parkinsons_525


## Data Representation (`MoleculeTable`)

Next extension of the `PandasDataSet` class is the `MoleculeTable` class. While `PandasDataSet` is a completely general class that can be used for any data set, `MoleculeTable` is specifically designed for data sets that contain molecular structures. It is a subclass of `PandasDataSet` and adds some useful functions for molecules. For example, you can easily convert the SMILES strings to RDKit molecules in your table:

In [19]:
from qsprpred.data.data import MoleculeTable

mt = MoleculeTable(df=df, store_dir="data", name="parkinsons", add_rdkit=True)
mt.getDF()

Unnamed: 0_level_0,SMILES,GABAAalpha,NMDA,P41594,Q13255,QSPRID,RDMol
QSPRID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
parkinsons_0,O=C(O)C1CC(CP(=O)(O)O)CCN1,,6.535,,,parkinsons_0,
parkinsons_1,N#Cc1cccc(-c2cc(-c3c(F)cc(F)cc3)nnc2)c1,7.705000,,,,parkinsons_1,
parkinsons_2,O=C(Nc1cc(-c2ccccc2)nn1-c1ccccc1)c1c(O)cccc1,,,5.000,,parkinsons_2,
parkinsons_3,CC(C)(C)c1nnc2CC(c3nc(-c4ccc(F)cc4)no3)CCn21,,,7.282,,parkinsons_3,
parkinsons_4,Cc1c(C2=CCC(C(=O)N(C)C(C)C)CC2)nnn1-c1c(F)nccc1,,,,8.22,parkinsons_4,
...,...,...,...,...,...,...,...
parkinsons_6220,CCC(=O)N1CCc2c(C1)c1nc(-c3c(F)cccc3)cn1c(O)n2,6.973333,,,,parkinsons_6220,
parkinsons_6221,Oc1ccc2[nH]c(-c3ccccc3)nc2c1CN1CCCCCC1,,6.200,,,parkinsons_6221,
parkinsons_6222,CC(C)(C)c1cn(C2=NCC(=O)N3CCc4c(cccc4C4CCCO4)C3...,,,6.290,5.00,parkinsons_6222,
parkinsons_6223,Cc1cccc(-c2noc(C3CCN3C(=O)c3ccc(C#N)cc3)n2)c1,,,5.850,,parkinsons_6223,


However, in addition to the ability to depict structures in Jupyter notebooks, this also allows you to use the `MoleculeTable` class to calculate molecular descriptors and fingerprints, determine scaffolds and standardize structures. We will go over some of these features in the next sections.

### Dropping Invalid Molecules and Standardizing Structures

Before making calculations, it is a good idea to standardize structures and drop invalid molecules. `MoleculeTable` provides a simple way to do this:

In [20]:
mt.standardizeSmiles('chembl', drop_invalid=True)

The code above uses the ChEMBL standardizer to standardize the structures and drops all invalid molecules. You can also do it separately, though:

In [21]:
mt.standardizeSmiles('chembl', drop_invalid=False)
mt.dropInvalids() # returns a boolean array of the same length as the unfiltered data frame that indicates which rows were dropped (True) and which were kept (False)

QSPRID
parkinsons_0       False
parkinsons_1       False
parkinsons_2       False
parkinsons_3       False
parkinsons_4       False
                   ...  
parkinsons_6220    False
parkinsons_6221    False
parkinsons_6222    False
parkinsons_6223    False
parkinsons_6224    False
Name: SMILES, Length: 6225, dtype: bool

### Calculating Molecular Descriptors

QSPRPred provides an interface to easily calculate molecular descriptors. The package already contains many descriptor implementations, but you can also easily add your own. Here is an example that calculates Morgan fingerprints and RDKit descriptors:

In [22]:
from qsprpred.data.utils.descriptorsets import rdkit_descs, FingerprintSet
from qsprpred.data.utils.descriptorcalculator import MoleculeDescriptorsCalculator

calc = MoleculeDescriptorsCalculator(descsets=[FingerprintSet("MorganFP", radius=3, nBits=2048), rdkit_descs()])
mt.addDescriptors(calc)

**Note:** You can also speed these calculations up with the `n_jobs`/`nJobs` parameter/attribute since descriptor calculation is implemented to process multiple sets of molecules in parallel if `mt.nJobs` is higher than 1.

Descriptors are saved in their own wrapped tables, which can be accessed with the `descriptors` attribute:

In [23]:
mt.descriptors

[<qsprpred.data.data.DescriptorTable at 0x7fdfcc3b2560>]

In [24]:
mt.descriptors[0].getDF().shape

(6225, 2257)

Adding more descriptors (i.e. [protein descriptors](data_preparation_advanced.ipynb)) later will append to this list. You can easily get the whole matrix of descriptors as follows:

In [25]:
mt.getDescriptors()

Unnamed: 0_level_0,Descriptor_FingerprintSet_MorganFP_0,Descriptor_FingerprintSet_MorganFP_1,Descriptor_FingerprintSet_MorganFP_2,Descriptor_FingerprintSet_MorganFP_3,Descriptor_FingerprintSet_MorganFP_4,Descriptor_FingerprintSet_MorganFP_5,Descriptor_FingerprintSet_MorganFP_6,Descriptor_FingerprintSet_MorganFP_7,Descriptor_FingerprintSet_MorganFP_8,Descriptor_FingerprintSet_MorganFP_9,...,Descriptor_RDkit_fr_sulfide,Descriptor_RDkit_fr_sulfonamd,Descriptor_RDkit_fr_sulfone,Descriptor_RDkit_fr_term_acetylene,Descriptor_RDkit_fr_tetrazole,Descriptor_RDkit_fr_thiazole,Descriptor_RDkit_fr_thiocyan,Descriptor_RDkit_fr_thiophene,Descriptor_RDkit_fr_unbrch_alkane,Descriptor_RDkit_fr_urea
QSPRID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
parkinsons_0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
parkinsons_1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
parkinsons_2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
parkinsons_3,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
parkinsons_4,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
parkinsons_6220,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
parkinsons_6221,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
parkinsons_6222,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
parkinsons_6223,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Data Representation (`QSPRDataset`)

The `QSPRDataset` class is the next extension of the `PandasDataSet` and `MoleculeTable` classes. It is a subclass of `MoleculeTable` and adds some useful functions for QSPR model training itself. You can create a `QSPRDataset` object from scratch as usual, but this time you will need to specify the target properties and tasks you would like to model. For example, a data set for a simple regression task would be defined as follows:

In [26]:
from qsprpred.models.tasks import TargetTasks
from qsprpred.data.data import QSPRDataset, TargetProperty

ds_qspr = QSPRDataset(df=df, store_dir="data", name="parkinsons", target_props=[{"name": "GABAAalpha", "task": TargetTasks.REGRESSION}])
# or
ds_qspr = QSPRDataset(df=df, store_dir="data", name="parkinsons", target_props=[TargetProperty("GABAAalpha", TargetTasks.REGRESSION)])
ds_qspr.getDF()

Unnamed: 0_level_0,SMILES,GABAAalpha,NMDA,P41594,Q13255,QSPRID,RDMol
QSPRID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
parkinsons_0,N#Cc1cccc(-c2cnnc(-c3ccc(F)cc3F)c2)c1,7.705000,,,,parkinsons_0,
parkinsons_1,Cc1noc(COc2nn3c(-c4ccccc4)nnc3c3c2C2CCC3CC2)n1,7.087500,,,,parkinsons_1,
parkinsons_2,CCCc1ncn2c1Cn1ncnc1-c1cc(Cl)ccc1-2,8.800000,,,,parkinsons_2,
parkinsons_3,Cc1onc(-c2ccccc2)c1COc1ccc(C(=O)NCCOC(C)C)cn1,7.380000,,,,parkinsons_3,
parkinsons_4,Cc1onc(-c2ccncc2)c1COc1ccc(C(=O)NCC(F)(F)F)cn1,8.230000,,,,parkinsons_4,
...,...,...,...,...,...,...,...
parkinsons_2045,CN1CCCC1Cc1c[nH]c2ccc(-n3ccc4cc(C#N)ccc43)cc12,6.000000,,,,parkinsons_2045,
parkinsons_2046,CC1Cc2c([nH]c3cc(Cl)c(F)cc23)C2(N1)C(=O)Nc1ccc...,5.000000,5.0,,,parkinsons_2046,
parkinsons_2047,Cc1onc(-c2ccc(F)cn2)c1COCc1ccc(C(N)=O)cn1,9.220000,,,,parkinsons_2047,
parkinsons_2048,Cc1noc(-c2nnc3c4c(c(OCc5ccc(C)c(C)n5)nn23)C2CC...,8.200000,,,,parkinsons_2048,


You can see that some rows from the original data frame were dropped automatically because they did not have a value for the specified `GABAAalpha` target property.

You may also notice that in this data set we are now missing our descriptors:

In [27]:
ds_qspr.descriptors

[]

In [28]:
ds_qspr.getDescriptors()

parkinsons_0
parkinsons_1
parkinsons_2
parkinsons_3
parkinsons_4
...
parkinsons_2045
parkinsons_2046
parkinsons_2047
parkinsons_2048
parkinsons_2049


This is because the `QSPRDataset` class does not know anything about the descriptors in our `MoleculeTable` object since it only uses the original data frame with the molecules and target properties. Therefore, there is also the `fromMolTable` method that allows you to create a `QSPRDataset` object from a `MoleculeTable` object while maintaining all data associated with it:

In [29]:
ds_qspr = QSPRDataset.fromMolTable(mt, target_props=[TargetProperty("GABAAalpha", TargetTasks.REGRESSION)])
ds_qspr.descriptors

[<qsprpred.data.data.DescriptorTable at 0x7fdfcc3b2560>]

In [30]:
ds_qspr.getDescriptors()

Unnamed: 0_level_0,Descriptor_FingerprintSet_MorganFP_0,Descriptor_FingerprintSet_MorganFP_1,Descriptor_FingerprintSet_MorganFP_2,Descriptor_FingerprintSet_MorganFP_3,Descriptor_FingerprintSet_MorganFP_4,Descriptor_FingerprintSet_MorganFP_5,Descriptor_FingerprintSet_MorganFP_6,Descriptor_FingerprintSet_MorganFP_7,Descriptor_FingerprintSet_MorganFP_8,Descriptor_FingerprintSet_MorganFP_9,...,Descriptor_RDkit_fr_sulfide,Descriptor_RDkit_fr_sulfonamd,Descriptor_RDkit_fr_sulfone,Descriptor_RDkit_fr_term_acetylene,Descriptor_RDkit_fr_tetrazole,Descriptor_RDkit_fr_thiazole,Descriptor_RDkit_fr_thiocyan,Descriptor_RDkit_fr_thiophene,Descriptor_RDkit_fr_unbrch_alkane,Descriptor_RDkit_fr_urea
QSPRID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
parkinsons_0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
parkinsons_1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
parkinsons_2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
parkinsons_3,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
parkinsons_4,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
parkinsons_2045,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
parkinsons_2046,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
parkinsons_2047,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
parkinsons_2048,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0


At this point, this data set is ready to be used for machine learning, which will be addressed in the [following tutorial](tutorial_training.ipynb). More advanced features like interfacing with the [Papyrus data set](https://jcheminf.biomedcentral.com/articles/10.1186/s13321-022-00672-x), calculating protein descriptors or adding your own descriptors are covered in the [Advanced Data Preparation](data_preparation_advanced.ipynb) notebook.