# Data Preparation

In this tutorial, you will learn how to prepare data sets with QSPRPred.

## Data Representation (`PandasDataSet`)

The package basically uses wrapped `pandas.DataFrame` objects with some useful functions added on top to facilitate features relevant for QSPR modeling. The `PandasDataSet` class is the  base class of all data sets in QSPRPred. Wrapping a `pandas.DataFrame` is easy:

In [1]:
# load a sample data set on parkinson's disease
import pandas as pd

random_state = 42

df = pd.read_table('data/parkinsons_pivot.tsv')
df

Unnamed: 0,SMILES,GABAAalpha,NMDA,O00222,O15303,P41594,Q13255,Q14416,Q14643,Q14831,Q14832,Q14833
0,Brc1cc(-c2nc(-c3ncccc3)no2)ccc1,,,,,6.93,,,,,,
1,Brc1ccc2c(c1)-c1ncnn1Cc1c(-c3cccs3)ncn-21,8.400000,,,,,,,,,,
2,Brc1ccc2c(c1)-c1nncn1Cc1c(I)ncn-21,8.110000,,,,,,,,,,
3,Brc1cccc(-c2cc(-c3ccccc3)nnc2)c1,8.013333,,,,,,,,,,
4,Brc1cccc(-c2cnnc(NCc3ccccc3)c2)c1,7.505000,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...
8217,c1cnc2c(c1)ncn2-c1cc2c(cc1)[nH]cc2CC1CCCN1,6.960000,,,,,,,,,,
8218,c1csc(-n2cc(-c3ccnc4ccccc34)cn2)c1,,,,,,,,,,,5.370
8219,c1csc(Nc2nc3c(CCCc4c-3cn[nH]4)s2)n1,,,,,,,,,,,5.000
8220,c1n[nH]c2c1-c1c(CCC2)sc(NC2CCCCC2)n1,,,,,,,,,,,5.888


In [2]:
from qsprpred.data.data import PandasDataSet

ds = PandasDataSet(df=df, store_dir="data", name="parkinsons", random_state=random_state)
ds

<qsprpred.data.data.PandasDataSet at 0x7f6fd8f828f0>

You can query this data set directly for simple information like the number of samples:

In [3]:
len(ds)

8222

the saved properties/features:

In [4]:
ds.getProperties()

Index(['SMILES', 'GABAAalpha', 'NMDA', 'O00222', 'O15303', 'P41594', 'Q13255',
       'Q14416', 'Q14643', 'Q14831', 'Q14832', 'Q14833', 'QSPRID'],
      dtype='object')

You can also do some operations on the data frame, like shuffle it:

In [5]:
ds.shuffle()

or drop some columns:

In [6]:
ds.removeProperty("Q14643")

However, you can always access the underlying data frame if more complex operations are needed:

In [7]:
df = ds.getDF()
df

Unnamed: 0_level_0,SMILES,GABAAalpha,NMDA,O00222,O15303,P41594,Q13255,Q14416,Q14643,Q14831,Q14832,Q14833,QSPRID
QSPRID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
parkinsons_8214,c1cnc(Nc2nc3c(CCc4n[nH]cc4-3)s2)cc1,,,,,,,,,,,6.659,parkinsons_8214
parkinsons_8033,Sc1nccc(C=Cc2ccccc2)n1,,,,,,,,,,,5.110,parkinsons_8033
parkinsons_6447,Nc1c(NCC(=O)O)c(=O)c1=O,,5.640000,,,,,,,,,,parkinsons_6447
parkinsons_5313,Cn1c(CN2CCN(c3nc4c(cccc4)s3)CC2)nc2c1cccc2,,,,,,,6.1,,,,,parkinsons_5313
parkinsons_8121,c1ccc(-n2cc(-c3ccnc4c3ccc(OCCCN3CCOCC3)c4)cn2)cc1,,,,,,,,,,,5.551,parkinsons_8121
...,...,...,...,...,...,...,...,...,...,...,...,...,...
parkinsons_5734,N#Cc1c(C2CC(F)(F)C2)c2c(nc1-c1ccccc1)[nH]nc2-c...,,,,,7.51,,,,,,,parkinsons_5734
parkinsons_5191,Clc1ccc(Nc2c3ncccc3n[nH]2)cc1Cl,,,,,,,,,,,6.950,parkinsons_5191
parkinsons_5390,Cn1ccn2c(-c3ccc(F)c(-c4ccc(F)cc4C#N)c3)cnc2c1=O,9.093333,,,,,,,,,,,parkinsons_5390
parkinsons_860,CC12NC(Cc3c1cccc3)c1ccccc12,,8.354286,,,,,,,,,,parkinsons_860


It is always possible to wrap it again:

In [8]:
ds = PandasDataSet(df=df, store_dir="data", name="parkinsons", random_state=random_state)
ds

<qsprpred.data.data.PandasDataSet at 0x7f6f85ee7fa0>

In [9]:
ds.getDF()

Unnamed: 0_level_0,SMILES,GABAAalpha,NMDA,O00222,O15303,P41594,Q13255,Q14416,Q14643,Q14831,Q14832,Q14833,QSPRID
QSPRID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
parkinsons_0,c1cnc(Nc2nc3c(CCc4n[nH]cc4-3)s2)cc1,,,,,,,,,,,6.659,parkinsons_0
parkinsons_1,Sc1nccc(C=Cc2ccccc2)n1,,,,,,,,,,,5.110,parkinsons_1
parkinsons_2,Nc1c(NCC(=O)O)c(=O)c1=O,,5.640000,,,,,,,,,,parkinsons_2
parkinsons_3,Cn1c(CN2CCN(c3nc4c(cccc4)s3)CC2)nc2c1cccc2,,,,,,,6.1,,,,,parkinsons_3
parkinsons_4,c1ccc(-n2cc(-c3ccnc4c3ccc(OCCCN3CCOCC3)c4)cn2)cc1,,,,,,,,,,,5.551,parkinsons_4
...,...,...,...,...,...,...,...,...,...,...,...,...,...
parkinsons_8217,N#Cc1c(C2CC(F)(F)C2)c2c(nc1-c1ccccc1)[nH]nc2-c...,,,,,7.51,,,,,,,parkinsons_8217
parkinsons_8218,Clc1ccc(Nc2c3ncccc3n[nH]2)cc1Cl,,,,,,,,,,,6.950,parkinsons_8218
parkinsons_8219,Cn1ccn2c(-c3ccc(F)c(-c4ccc(F)cc4C#N)c3)cnc2c1=O,9.093333,,,,,,,,,,,parkinsons_8219
parkinsons_8220,CC12NC(Cc3c1cccc3)c1ccccc12,,8.354286,,,,,,,,,,parkinsons_8220


### Data Indexing

You might have noticed that when recreated again from a new data frame the "QSPRID" column was reset while the data still remained in the same order as after shuffling. This is because index gets automatically reset when a new `PandasDataSet` object is created. However, you can always set the index to a specific column when creating the data set:

In [10]:
ds.shuffle()
df = ds.getDF()
df

Unnamed: 0_level_0,SMILES,GABAAalpha,NMDA,O00222,O15303,P41594,Q13255,Q14416,Q14643,Q14831,Q14832,Q14833,QSPRID
QSPRID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
parkinsons_8214,Cc1ccc2nc(-n3c(O)nc4ccccc43)oc2c1,6.52,,,,,,,,,,,parkinsons_8214
parkinsons_8033,Cc1cn2cc(C(F)(F)F)cc2c(C#Cc2ccc(C#N)cc2)n1,,,,,4.80,,,,,,,parkinsons_8033
parkinsons_6447,c1ccc(-n2cc(-c3ccnc4c3ccc(OCCCN3CCCCC3)c4)cn2)cc1,,,,,,,,,,,5.0,parkinsons_6447
parkinsons_5313,O=C1c2cc(COc3ccccc3)nn2CCN1c1c(F)cc(F)cc1,,,,,7.00,,,,,,,parkinsons_5313
parkinsons_8121,Cc1nc(C)c(Oc2c(Cl)cc(-c3c(Cl)c4nnc(CC5CC5)n4cc...,,,,,,,8.377,,,,,parkinsons_8121
...,...,...,...,...,...,...,...,...,...,...,...,...,...
parkinsons_5734,Nc1c(NCCCC(=O)O)c(=O)c1=O,,5.0,,,,,,,,,,parkinsons_5734
parkinsons_5191,COc1c(Cl)cc2nc(C3N=CC=N3)[nH]c2c1,,,,,,,,,,,5.0,parkinsons_5191
parkinsons_5390,CN1CCN(Cc2ccc3c(-c4ccc(F)cc4)cc(C(N)=O)nc3c2)CC1,,,,,,,6.600,,,,,parkinsons_5390
parkinsons_860,CC1(C)CCN(c2ccc(C#Cc3cncc(Cl)c3)cn2)C(=O)O1,,,,,7.57,,,,,,,parkinsons_860


In [11]:
ds = PandasDataSet(df=df, store_dir="data", name="parkinsons", index_cols=["QSPRID"], random_state=random_state)
ds.getDF()

Unnamed: 0_level_0,SMILES,GABAAalpha,NMDA,O00222,O15303,P41594,Q13255,Q14416,Q14643,Q14831,Q14832,Q14833,QSPRID
QSPRID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
parkinsons_8214,Cc1ccc2nc(-n3c(O)nc4ccccc43)oc2c1,6.52,,,,,,,,,,,parkinsons_8214
parkinsons_8033,Cc1cn2cc(C(F)(F)F)cc2c(C#Cc2ccc(C#N)cc2)n1,,,,,4.80,,,,,,,parkinsons_8033
parkinsons_6447,c1ccc(-n2cc(-c3ccnc4c3ccc(OCCCN3CCCCC3)c4)cn2)cc1,,,,,,,,,,,5.0,parkinsons_6447
parkinsons_5313,O=C1c2cc(COc3ccccc3)nn2CCN1c1c(F)cc(F)cc1,,,,,7.00,,,,,,,parkinsons_5313
parkinsons_8121,Cc1nc(C)c(Oc2c(Cl)cc(-c3c(Cl)c4nnc(CC5CC5)n4cc...,,,,,,,8.377,,,,,parkinsons_8121
...,...,...,...,...,...,...,...,...,...,...,...,...,...
parkinsons_5734,Nc1c(NCCCC(=O)O)c(=O)c1=O,,5.0,,,,,,,,,,parkinsons_5734
parkinsons_5191,COc1c(Cl)cc2nc(C3N=CC=N3)[nH]c2c1,,,,,,,,,,,5.0,parkinsons_5191
parkinsons_5390,CN1CCN(Cc2ccc3c(-c4ccc(F)cc4)cc(C(N)=O)nc3c2)CC1,,,,,,,6.600,,,,,parkinsons_5390
parkinsons_860,CC1(C)CCN(c2ccc(C#Cc3cncc(Cl)c3)cn2)C(=O)O1,,,,,7.57,,,,,,,parkinsons_860


Being aware of the index will help you track down the compounds and associated data further down the line so get used to it. You can always reset the index to a custom column as well:

In [12]:
ds.setIndex(["SMILES"])
ds.getDF().index

Index(['Cc1ccc2nc(-n3c(O)nc4ccccc43)oc2c1',
       'Cc1cn2cc(C(F)(F)F)cc2c(C#Cc2ccc(C#N)cc2)n1',
       'c1ccc(-n2cc(-c3ccnc4c3ccc(OCCCN3CCCCC3)c4)cn2)cc1',
       'O=C1c2cc(COc3ccccc3)nn2CCN1c1c(F)cc(F)cc1',
       'Cc1nc(C)c(Oc2c(Cl)cc(-c3c(Cl)c4nnc(CC5CC5)n4cc3)cc2)cc1',
       'COc1ccc(C2CCN(c3c(C#N)c(=O)n(CC4CC4)cc3)CC2)cc1',
       'CCCCCCC1CN(c2ccc(C#N)cc2)C(=O)O1',
       'Fc1c(NC2CC2)ccc(-c2c(Cl)c3nnc(CC4CC4)n3cc2)c1',
       'CCOC(=O)c1ncc2sc3c(cccc3)c2c1',
       'Clc1cc2c(cc1)c(NCCCNCCCOc1cccc3c1c1c(cccc1)[nH]3)c1CCCCc1n2',
       ...
       'OC1CN(CCCOc2ccccc2)CCc2c1ccc(OCc1ccccc1)c2',
       'Cn1ncnc1COc1nn2c(-c3ccccc3F)nncc2c1-c1ccc(F)cc1',
       'O=C(NCc1c(F)cccc1)C(=O)c1c[nH]c2ccccc12',
       'Cc1cc(-c2c3CCOCCn3c(-c3ccc(Cl)c(OC(F)(F)F)c3)n2)cc(C)n1',
       'Cc1cccc2c1nc(-c1ccccc1)[nH]2', 'Nc1c(NCCCC(=O)O)c(=O)c1=O',
       'COc1c(Cl)cc2nc(C3N=CC=N3)[nH]c2c1',
       'CN1CCN(Cc2ccc3c(-c4ccc(F)cc4)cc(C(N)=O)nc3c2)CC1',
       'CC1(C)CCN(c2ccc(C#Cc3cncc(Cl)c3)cn2)C(=O)

or even use multiple columns as index:

In [13]:
ds.setIndex(["SMILES", "QSPRID"])
ds.getDF().index

MultiIndex([(                          'Cc1ccc2nc(-n3c(O)nc4ccccc43)oc2c1', ...),
            (                 'Cc1cn2cc(C(F)(F)F)cc2c(C#Cc2ccc(C#N)cc2)n1', ...),
            (          'c1ccc(-n2cc(-c3ccnc4c3ccc(OCCCN3CCCCC3)c4)cn2)cc1', ...),
            (                  'O=C1c2cc(COc3ccccc3)nn2CCN1c1c(F)cc(F)cc1', ...),
            (    'Cc1nc(C)c(Oc2c(Cl)cc(-c3c(Cl)c4nnc(CC5CC5)n4cc3)cc2)cc1', ...),
            (            'COc1ccc(C2CCN(c3c(C#N)c(=O)n(CC4CC4)cc3)CC2)cc1', ...),
            (                           'CCCCCCC1CN(c2ccc(C#N)cc2)C(=O)O1', ...),
            (              'Fc1c(NC2CC2)ccc(-c2c(Cl)c3nnc(CC4CC4)n3cc2)c1', ...),
            (                              'CCOC(=O)c1ncc2sc3c(cccc3)c2c1', ...),
            ('Clc1cc2c(cc1)c(NCCCNCCCOc1cccc3c1c1c(cccc1)[nH]3)c1CCCCc1n2', ...),
            ...
            (                 'OC1CN(CCCOc2ccccc2)CCc2c1ccc(OCc1ccccc1)c2', ...),
            (            'Cn1ncnc1COc1nn2c(-c3ccccc3F)nncc2c1-c1ccc(F)cc1', ...),


### Parallelization

The `PandasDataSet` class also provides a simple way to parallelize operations on the data frame. For example, you can easily apply a function to all rows of the data frame in parallel. All you need to do is initialize the `PandasDataSet` object with the number of CPUs you want to use:

In [14]:
ds = PandasDataSet(df=df, store_dir="data", name="parkinsons", index_cols=["QSPRID"], n_jobs=2, random_state=random_state)
df = ds.getDF()
df

Unnamed: 0_level_0,SMILES,GABAAalpha,NMDA,O00222,O15303,P41594,Q13255,Q14416,Q14643,Q14831,Q14832,Q14833,QSPRID
QSPRID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
parkinsons_8214,Cc1ccc2nc(-n3c(O)nc4ccccc43)oc2c1,6.52,,,,,,,,,,,parkinsons_8214
parkinsons_8033,Cc1cn2cc(C(F)(F)F)cc2c(C#Cc2ccc(C#N)cc2)n1,,,,,4.80,,,,,,,parkinsons_8033
parkinsons_6447,c1ccc(-n2cc(-c3ccnc4c3ccc(OCCCN3CCCCC3)c4)cn2)cc1,,,,,,,,,,,5.0,parkinsons_6447
parkinsons_5313,O=C1c2cc(COc3ccccc3)nn2CCN1c1c(F)cc(F)cc1,,,,,7.00,,,,,,,parkinsons_5313
parkinsons_8121,Cc1nc(C)c(Oc2c(Cl)cc(-c3c(Cl)c4nnc(CC5CC5)n4cc...,,,,,,,8.377,,,,,parkinsons_8121
...,...,...,...,...,...,...,...,...,...,...,...,...,...
parkinsons_5734,Nc1c(NCCCC(=O)O)c(=O)c1=O,,5.0,,,,,,,,,,parkinsons_5734
parkinsons_5191,COc1c(Cl)cc2nc(C3N=CC=N3)[nH]c2c1,,,,,,,,,,,5.0,parkinsons_5191
parkinsons_5390,CN1CCN(Cc2ccc3c(-c4ccc(F)cc4)cc(C(N)=O)nc3c2)CC1,,,,,,,6.600,,,,,parkinsons_5390
parkinsons_860,CC1(C)CCN(c2ccc(C#Cc3cncc(Cl)c3)cn2)C(=O)O1,,,,,7.57,,,,,,,parkinsons_860


Now, all operations that are parallelized will be done in parallel. One example is the `apply` method that allows you to apply a function over a subset of rows of the data frame:

In [15]:
import time

def add_one_slow(x):
    """ Emulate a slow function."""

    time.sleep(0.001) # 100ms delay

    return f"{x[1]}+{x[0]}" # simply concatenate the values from the two columns in our subset

now = time.perf_counter()
res = ds.apply(add_one_slow, subset=["SMILES", "QSPRID"], axis=1)
print(time.perf_counter() - now)

Parallel apply in progress for parkinsons.:   0%|          | 0/5 [00:00<?, ?it/s]

4.914074870001059


In [16]:
res

QSPRID
parkinsons_8214    parkinsons_8214+Cc1ccc2nc(-n3c(O)nc4ccccc43)oc2c1
parkinsons_8033    parkinsons_8033+Cc1cn2cc(C(F)(F)F)cc2c(C#Cc2cc...
parkinsons_6447    parkinsons_6447+c1ccc(-n2cc(-c3ccnc4c3ccc(OCCC...
parkinsons_5313    parkinsons_5313+O=C1c2cc(COc3ccccc3)nn2CCN1c1c...
parkinsons_8121    parkinsons_8121+Cc1nc(C)c(Oc2c(Cl)cc(-c3c(Cl)c...
                                         ...                        
parkinsons_5734            parkinsons_5734+Nc1c(NCCCC(=O)O)c(=O)c1=O
parkinsons_5191    parkinsons_5191+COc1c(Cl)cc2nc(C3N=CC=N3)[nH]c2c1
parkinsons_5390    parkinsons_5390+CN1CCN(Cc2ccc3c(-c4ccc(F)cc4)c...
parkinsons_860     parkinsons_860+CC1(C)CCN(c2ccc(C#Cc3cncc(Cl)c3...
parkinsons_7270         parkinsons_7270+O=C1NCCCC1(c1ccccc1)N1CCCCC1
Length: 8222, dtype: object

Compare this with the time it takes to do the same thing with only one CPU:

In [17]:
ds.nJobs = 1
now = time.perf_counter()
ds.apply(add_one_slow, subset=["SMILES", "QSPRID"], axis=1)
print(time.perf_counter() - now)

9.419596627998544


### Saving and Loading

The `PandasDataSet` class also provides a simple way to save and load data sets. You can save the data set and any associated data to a directory. By default this is the current directory, but you can specify a different directory upon creation of the data set:

In [18]:
ds = PandasDataSet(df=df, store_dir="data", name="parkinsons_pandas", index_cols=["QSPRID"], random_state=random_state)
ds.save()

'data/parkinsons_pandas_df.pkl'

Reloading the data set is easy as well. We just use its name to initialize a new `PandasDataSet` object:

In [19]:
ds = PandasDataSet(name="parkinsons_pandas", store_dir="data")
ds.getDF()

Unnamed: 0_level_0,SMILES,GABAAalpha,NMDA,O00222,O15303,P41594,Q13255,Q14416,Q14643,Q14831,Q14832,Q14833,QSPRID
QSPRID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
parkinsons_8214,Cc1ccc2nc(-n3c(O)nc4ccccc43)oc2c1,6.52,,,,,,,,,,,parkinsons_8214
parkinsons_8033,Cc1cn2cc(C(F)(F)F)cc2c(C#Cc2ccc(C#N)cc2)n1,,,,,4.80,,,,,,,parkinsons_8033
parkinsons_6447,c1ccc(-n2cc(-c3ccnc4c3ccc(OCCCN3CCCCC3)c4)cn2)cc1,,,,,,,,,,,5.0,parkinsons_6447
parkinsons_5313,O=C1c2cc(COc3ccccc3)nn2CCN1c1c(F)cc(F)cc1,,,,,7.00,,,,,,,parkinsons_5313
parkinsons_8121,Cc1nc(C)c(Oc2c(Cl)cc(-c3c(Cl)c4nnc(CC5CC5)n4cc...,,,,,,,8.377,,,,,parkinsons_8121
...,...,...,...,...,...,...,...,...,...,...,...,...,...
parkinsons_5734,Nc1c(NCCCC(=O)O)c(=O)c1=O,,5.0,,,,,,,,,,parkinsons_5734
parkinsons_5191,COc1c(Cl)cc2nc(C3N=CC=N3)[nH]c2c1,,,,,,,,,,,5.0,parkinsons_5191
parkinsons_5390,CN1CCN(Cc2ccc3c(-c4ccc(F)cc4)cc(C(N)=O)nc3c2)CC1,,,,,,,6.600,,,,,parkinsons_5390
parkinsons_860,CC1(C)CCN(c2ccc(C#Cc3cncc(Cl)c3)cn2)C(=O)O1,,,,,7.57,,,,,,,parkinsons_860


## Data Representation (`MoleculeTable`)

Next extension of the `PandasDataSet` class is the `MoleculeTable` class. While `PandasDataSet` is a completely general class that can be used for any data set, `MoleculeTable` is specifically designed for data sets that contain molecular structures. It is a subclass of `PandasDataSet` and adds some useful functions for molecules. For example, you can easily convert the SMILES strings to RDKit molecules in your table:

In [20]:
from qsprpred.data.data import MoleculeTable

mt = MoleculeTable(df=df, store_dir="data", name="parkinsons", add_rdkit=True)
mt.getDF()

Unnamed: 0_level_0,SMILES,GABAAalpha,NMDA,O00222,O15303,P41594,Q13255,Q14416,Q14643,Q14831,Q14832,Q14833,QSPRID,RDMol
QSPRID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
parkinsons_0,Cc1ccc2nc(-n3c(O)nc4ccccc43)oc2c1,6.52,,,,,,,,,,,parkinsons_0,
parkinsons_1,Cc1cn2cc(C(F)(F)F)cc2c(C#Cc2ccc(C#N)cc2)n1,,,,,4.80,,,,,,,parkinsons_1,
parkinsons_2,c1ccc(-n2cc(-c3ccnc4c3ccc(OCCCN3CCCCC3)c4)cn2)cc1,,,,,,,,,,,5.0,parkinsons_2,
parkinsons_3,O=C1c2cc(COc3ccccc3)nn2CCN1c1c(F)cc(F)cc1,,,,,7.00,,,,,,,parkinsons_3,
parkinsons_4,Cc1nc(C)c(Oc2c(Cl)cc(-c3c(Cl)c4nnc(CC5CC5)n4cc...,,,,,,,8.377,,,,,parkinsons_4,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
parkinsons_8217,Nc1c(NCCCC(=O)O)c(=O)c1=O,,5.0,,,,,,,,,,parkinsons_8217,
parkinsons_8218,COc1c(Cl)cc2nc(C3N=CC=N3)[nH]c2c1,,,,,,,,,,,5.0,parkinsons_8218,
parkinsons_8219,CN1CCN(Cc2ccc3c(-c4ccc(F)cc4)cc(C(N)=O)nc3c2)CC1,,,,,,,6.600,,,,,parkinsons_8219,
parkinsons_8220,CC1(C)CCN(c2ccc(C#Cc3cncc(Cl)c3)cn2)C(=O)O1,,,,,7.57,,,,,,,parkinsons_8220,


However, in addition to the ability to depict structures in Jupyter notebooks, this also allows you to use the `MoleculeTable` class to calculate molecular descriptors and fingerprints, determine scaffolds and standardize structures. We will go over some of these features in the next sections.

### Dropping Invalid Molecules and Standardizing Structures

Before making calculations, it is a good idea to standardize structures and drop invalid molecules. `MoleculeTable` provides a simple way to do this:

In [21]:
mt.standardizeSmiles('chembl', drop_invalid=True)

The code above uses the ChEMBL standardizer to standardize the structures and drops all invalid molecules. You can also do it separately, though:

In [22]:
mt.standardizeSmiles('chembl', drop_invalid=False)
mt.dropInvalids() # returns a boolean array of the same length as the unfiltered data frame that indicates which rows were dropped (True) and which were kept (False)

QSPRID
parkinsons_0       False
parkinsons_1       False
parkinsons_2       False
parkinsons_3       False
parkinsons_4       False
                   ...  
parkinsons_8217    False
parkinsons_8218    False
parkinsons_8219    False
parkinsons_8220    False
parkinsons_8221    False
Name: SMILES, Length: 8222, dtype: bool

### Calculating Molecular Descriptors

QSPRPred provides an interface to easily calculate molecular descriptors. The package already contains many descriptor implementations, but you can also easily add your own. Here is an example that calculates Morgan fingerprints and RDKit descriptors:

In [23]:
from qsprpred.data.utils.descriptorsets import RDKitDescs, FingerprintSet
from qsprpred.data.utils.descriptorcalculator import MoleculeDescriptorsCalculator

calc = MoleculeDescriptorsCalculator(desc_sets=[FingerprintSet("MorganFP", radius=3, nBits=2048), RDKitDescs()])
mt.addDescriptors(calc)

**Note:** You can also speed these calculations up with the `n_jobs`/`nJobs` parameter/attribute since descriptor calculation is implemented to process multiple sets of molecules in parallel if `mt.nJobs` is higher than 1.

Descriptors are saved in their own wrapped tables, which can be accessed with the `descriptors` attribute:

In [24]:
mt.descriptors

[<qsprpred.data.data.DescriptorTable at 0x7f6fd8f83fd0>]

In [25]:
mt.descriptors[0].getDF().shape

(8222, 2257)

Adding more descriptors (i.e. [protein descriptors](data_preparation_advanced.ipynb)) later will append to this list. You can easily get the whole matrix of descriptors as follows:

In [26]:
mt.getDescriptors()

Unnamed: 0_level_0,Descriptor_FingerprintSet_MorganFP_0,Descriptor_FingerprintSet_MorganFP_1,Descriptor_FingerprintSet_MorganFP_2,Descriptor_FingerprintSet_MorganFP_3,Descriptor_FingerprintSet_MorganFP_4,Descriptor_FingerprintSet_MorganFP_5,Descriptor_FingerprintSet_MorganFP_6,Descriptor_FingerprintSet_MorganFP_7,Descriptor_FingerprintSet_MorganFP_8,Descriptor_FingerprintSet_MorganFP_9,...,Descriptor_RDkit_fr_sulfide,Descriptor_RDkit_fr_sulfonamd,Descriptor_RDkit_fr_sulfone,Descriptor_RDkit_fr_term_acetylene,Descriptor_RDkit_fr_tetrazole,Descriptor_RDkit_fr_thiazole,Descriptor_RDkit_fr_thiocyan,Descriptor_RDkit_fr_thiophene,Descriptor_RDkit_fr_unbrch_alkane,Descriptor_RDkit_fr_urea
QSPRID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
parkinsons_0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
parkinsons_1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
parkinsons_2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
parkinsons_3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
parkinsons_4,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
parkinsons_8217,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
parkinsons_8218,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
parkinsons_8219,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
parkinsons_8220,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Data Representation (`QSPRDataset`)

The `QSPRDataset` class is the next extension of the `PandasDataSet` and `MoleculeTable` classes. It is a subclass of `MoleculeTable` and adds some useful functions for QSPR model training itself. You can create a `QSPRDataset` object from scratch as usual, but this time you will need to specify the target properties and tasks you would like to model. For example, a data set for a simple regression task would be defined as follows:

In [27]:
from qsprpred.models.tasks import TargetTasks
from qsprpred.data.data import QSPRDataset, TargetProperty

ds_qspr = QSPRDataset(df=df, store_dir="data", name="parkinsons", target_props=[{"name": "GABAAalpha", "task": TargetTasks.REGRESSION}], random_state=random_state)
# or
ds_qspr = QSPRDataset(df=df, store_dir="data", name="parkinsons", target_props=[TargetProperty("GABAAalpha", TargetTasks.REGRESSION)], random_state=random_state)
ds_qspr.getDF()

Unnamed: 0_level_0,SMILES,GABAAalpha,NMDA,O00222,O15303,P41594,Q13255,Q14416,Q14643,Q14831,Q14832,Q14833,QSPRID,RDMol
QSPRID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
parkinsons_0,Cc1ccc2nc(-n3c(O)nc4ccccc43)oc2c1,6.52000,,,,,,,,,,,parkinsons_0,
parkinsons_1,CCOC(=O)c1cc2c(cn1)sc1ccccc12,5.77000,,,,,,,,,,,parkinsons_1,
parkinsons_2,CCOC(=O)c1cn(-c2ccc(Cl)cc2Cl)c(-c2ccccc2)n1,5.75000,,,,,,,,,,,parkinsons_2,
parkinsons_3,c1ccc(-c2nnc3c4ccccc4c(OCC4CCCCC4)nn23)cc1,6.00000,,,,,,,,,,,parkinsons_3,
parkinsons_4,CCOC(=O)c1c(C)nc2sc3c(c2c1N)CCC1(C3)OCCO1,4.79000,,,,,,,,,,,parkinsons_4,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
parkinsons_2045,O=[N+]([O-])c1ccc2[nH]c3cnc([N+](=O)[O-])cc3c2c1,6.44275,,,,,,,,,,,parkinsons_2045,
parkinsons_2046,O=C(O)Cc1c[nH]cn1,3.90700,,,,,,,,,,,parkinsons_2046,
parkinsons_2047,COc1ccccc1COC(=O)c1cnn2c1nnc1ccc(Cl)cc12,8.77000,,,,,,,,,,,parkinsons_2047,
parkinsons_2048,Cn1ncnc1COc1nn2c(-c3ccccc3F)nncc2c1-c1ccc(F)cc1,8.20000,,,,,,,,,,,parkinsons_2048,


You can see that some rows from the original data frame were dropped automatically because they did not have a value for the specified `GABAAalpha` target property.

You may also notice that in this data set we are now missing our descriptors:

In [28]:
ds_qspr.descriptors

[]

In [29]:
ds_qspr.getDescriptors()

parkinsons_0
parkinsons_1
parkinsons_2
parkinsons_3
parkinsons_4
...
parkinsons_2045
parkinsons_2046
parkinsons_2047
parkinsons_2048
parkinsons_2049


This is because the `QSPRDataset` class does not know anything about the descriptors in our `MoleculeTable` object since it only uses the original data frame with the molecules and target properties. Therefore, there is also the `fromMolTable` method that allows you to create a `QSPRDataset` object from a `MoleculeTable` object while maintaining all data associated with it:

In [30]:
ds_qspr = QSPRDataset.fromMolTable(mt, target_props=[TargetProperty("GABAAalpha", TargetTasks.REGRESSION)])
ds_qspr.descriptors

[<qsprpred.data.data.DescriptorTable at 0x7f6fd8f83fd0>]

In [31]:
ds_qspr.getDescriptors()

Unnamed: 0_level_0,Descriptor_FingerprintSet_MorganFP_0,Descriptor_FingerprintSet_MorganFP_1,Descriptor_FingerprintSet_MorganFP_2,Descriptor_FingerprintSet_MorganFP_3,Descriptor_FingerprintSet_MorganFP_4,Descriptor_FingerprintSet_MorganFP_5,Descriptor_FingerprintSet_MorganFP_6,Descriptor_FingerprintSet_MorganFP_7,Descriptor_FingerprintSet_MorganFP_8,Descriptor_FingerprintSet_MorganFP_9,...,Descriptor_RDkit_fr_sulfide,Descriptor_RDkit_fr_sulfonamd,Descriptor_RDkit_fr_sulfone,Descriptor_RDkit_fr_term_acetylene,Descriptor_RDkit_fr_tetrazole,Descriptor_RDkit_fr_thiazole,Descriptor_RDkit_fr_thiocyan,Descriptor_RDkit_fr_thiophene,Descriptor_RDkit_fr_unbrch_alkane,Descriptor_RDkit_fr_urea
QSPRID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
parkinsons_0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
parkinsons_1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
parkinsons_2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
parkinsons_3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
parkinsons_4,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
parkinsons_2045,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
parkinsons_2046,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
parkinsons_2047,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
parkinsons_2048,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


At this point, this data set is ready to be used for machine learning, which will be addressed in the [following tutorial](tutorial_training.ipynb). More advanced features like interfacing with the [Papyrus data set](https://jcheminf.biomedcentral.com/articles/10.1186/s13321-022-00672-x), calculating protein descriptors or adding your own descriptors are covered in the [Advanced Data Preparation](data_preparation_advanced.ipynb) notebook.