# Data Preparation

In this tutorial, you will learn how to prepare data sets with QSPRPred.

## Data Representation (`PandasDataSet`)

The package basically uses wrapped `pandas.DataFrame` objects with some useful functions added on top to facilitate features relevant for QSPR modeling. The `PandasDataSet` class is the  base class of all data sets in QSPRPred. Wrapping a `pandas.DataFrame` is easy:

In [1]:
# load a sample data set on parkinson's disease
import pandas as pd

df = pd.read_table('data/parkinsons_pivot.tsv')
df

Unnamed: 0,SMILES,GABAAalpha,NMDA,P41594,Q13255,Q14643
0,Brc1cc(-c2nc(-c3ncccc3)no2)ccc1,,,6.93,,
1,Brc1ccc2c(c1)-c1ncnn1Cc1c(-c3cccs3)ncn-21,8.400000,,,,
2,Brc1ccc2c(c1)-c1nncn1Cc1c(I)ncn-21,8.110000,,,,
3,Brc1cccc(-c2cc(-c3ccccc3)nnc2)c1,8.013333,,,,
4,Brc1cccc(-c2cnnc(NCc3ccccc3)c2)c1,7.505000,,,,
...,...,...,...,...,...,...
6220,c1cnc(COc2nn3c(C4CCC4)nnc3c3c2cccc3)cc1,7.700000,,,,
6221,c1cnc(COc2nn3c(N4CCOCC4)nnc3c3c2C2CCC3CC2)cc1,6.000000,,,,
6222,c1cnc(N2CCc3c(cccc3)C2)cc1,,6.12,,,
6223,c1cnc(Oc2cccc(-n3nnc(-c4ncccc4)n3)c2)cc1,,,5.54,,


In [2]:
from qsprpred.data.data import PandasDataSet

ds = PandasDataSet(df=df, store_dir="data", name="parkinsons")
ds

<qsprpred.data.data.PandasDataSet at 0x7fbb92cf90c0>

You can query this data set directly for simple information like the number of samples:

In [3]:
len(ds)

6225

the saved properties/features:

In [4]:
ds.getProperties()

Index(['SMILES', 'GABAAalpha', 'NMDA', 'P41594', 'Q13255', 'Q14643', 'QSPRID'], dtype='object')

You can also do some operations on the data frame, like shuffle it:

In [5]:
ds.shuffle()

or drop some columns:

In [6]:
ds.removeProperty("Q14643")

However, you can always access the underlying data frame if more complex operations are needed:

In [7]:
df = ds.getDF()
df

Unnamed: 0_level_0,SMILES,GABAAalpha,NMDA,P41594,Q13255,QSPRID
QSPRID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
parkinsons_5192,O=C(Nc1nccc(Oc2cccnc2)n1)c1cc(Cl)ccc1,,,5.000,,parkinsons_5192
parkinsons_1574,CN(C)c1c2c(ncc1)sc1c2ncn(CC2CCCC2)c1=O,,,4.000,7.58,parkinsons_1574
parkinsons_96,C#Cc1ncc(C#Cc2csc(C)n2)cc1,,,5.530,,parkinsons_96
parkinsons_3538,Cc1cn2cc(C(=O)N3CC4CC3CN4C)cc2c(C#Cc2cscc2)n1,,,5.400,,parkinsons_3538
parkinsons_4486,N=C(C=Cc1ccccc1)NCCCc1ccccc1,,6.920000,,,parkinsons_4486
...,...,...,...,...,...,...
parkinsons_2145,COc1cc(CN2C(=O)c3c(cccc3)C2=O)c(NC(=O)c2ccccn2...,,,5.852,,parkinsons_2145
parkinsons_643,CC1(C)CN(c2ccc(C#Cc3ccc(F)c(F)c3)cn2)C(=O)O1,,,7.500,,parkinsons_643
parkinsons_2918,Cc1c(COc2ccc(C(=O)NCC(O)C(F)(F)F)cn2)c(-c2cccc...,8.64,,,,parkinsons_2918
parkinsons_1150,CCNC(=O)c1ccc(OCc2c(C)onc2-c2ccccc2)nc1,8.54,,,,parkinsons_1150


It is always possible to wrap it again:

In [8]:
ds = PandasDataSet(df=df, store_dir="data", name="parkinsons")
ds

<qsprpred.data.data.PandasDataSet at 0x7fbac6facdf0>

In [9]:
ds.getDF()

Unnamed: 0_level_0,SMILES,GABAAalpha,NMDA,P41594,Q13255,QSPRID
QSPRID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
parkinsons_0,O=C(Nc1nccc(Oc2cccnc2)n1)c1cc(Cl)ccc1,,,5.000,,parkinsons_0
parkinsons_1,CN(C)c1c2c(ncc1)sc1c2ncn(CC2CCCC2)c1=O,,,4.000,7.58,parkinsons_1
parkinsons_2,C#Cc1ncc(C#Cc2csc(C)n2)cc1,,,5.530,,parkinsons_2
parkinsons_3,Cc1cn2cc(C(=O)N3CC4CC3CN4C)cc2c(C#Cc2cscc2)n1,,,5.400,,parkinsons_3
parkinsons_4,N=C(C=Cc1ccccc1)NCCCc1ccccc1,,6.920000,,,parkinsons_4
...,...,...,...,...,...,...
parkinsons_6220,COc1cc(CN2C(=O)c3c(cccc3)C2=O)c(NC(=O)c2ccccn2...,,,5.852,,parkinsons_6220
parkinsons_6221,CC1(C)CN(c2ccc(C#Cc3ccc(F)c(F)c3)cn2)C(=O)O1,,,7.500,,parkinsons_6221
parkinsons_6222,Cc1c(COc2ccc(C(=O)NCC(O)C(F)(F)F)cn2)c(-c2cccc...,8.64,,,,parkinsons_6222
parkinsons_6223,CCNC(=O)c1ccc(OCc2c(C)onc2-c2ccccc2)nc1,8.54,,,,parkinsons_6223


### Data Indexing

You might have noticed that when recreated again from a new data frame the "QSPRID" column was reset while the data still remained in the same order as after shuffling. This is because index gets automatically reset when a new `PandasDataSet` object is created. However, you can always set the index to a specific column when creating the data set:

In [10]:
ds.shuffle()
df = ds.getDF()
df

Unnamed: 0_level_0,SMILES,GABAAalpha,NMDA,P41594,Q13255,QSPRID
QSPRID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
parkinsons_5856,O=C1NCCc2c1ccc(C#Cc1ccccc1F)c2,,,6.82,,parkinsons_5856
parkinsons_3991,CC(C)(C)c1cn(C2=NCC(=O)N3CCc4c(cccc4C4CCCO4)C3...,,,6.29,5.0,parkinsons_3991
parkinsons_5104,COCc1c2c(cnc1N=C=S)[nH]c1c2cc(OCc2ccccc2)cc1,6.742000,,,,parkinsons_5104
parkinsons_4228,C#Cc1ccc2c(c1)C(=O)N(C)Cc1c(C(=O)OC(C)(C)C)ncn-21,6.891333,,,,parkinsons_4228
parkinsons_332,NC1=Nc2c(cccc2)-c2c(cccc2)N1,,5.00,,,parkinsons_332
...,...,...,...,...,...,...
parkinsons_5659,O=C1NCN(c2ccccc2)C12CCN(C1Cc3c4c1cccc4ccc3)CC2,6.500000,,,,parkinsons_5659
parkinsons_4013,O=C(Nc1cc(-c2ccccc2)nn1-c1ccccc1)c1cccs1,,,6.70,,parkinsons_4013
parkinsons_4675,N=C(N)NN=Cc1ccc(O)cc1O,,3.67,,,parkinsons_4675
parkinsons_110,Cc1cccc(S(=O)(=O)N2CC=C(C#Cc3ncccc3)CC2)c1,,,8.09,,parkinsons_110


In [11]:
ds = PandasDataSet(df=df, store_dir="data", name="parkinsons", index_cols=["QSPRID"])
ds.getDF()

Unnamed: 0_level_0,SMILES,GABAAalpha,NMDA,P41594,Q13255,QSPRID
QSPRID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
parkinsons_5856,O=C1NCCc2c1ccc(C#Cc1ccccc1F)c2,,,6.82,,parkinsons_5856
parkinsons_3991,CC(C)(C)c1cn(C2=NCC(=O)N3CCc4c(cccc4C4CCCO4)C3...,,,6.29,5.0,parkinsons_3991
parkinsons_5104,COCc1c2c(cnc1N=C=S)[nH]c1c2cc(OCc2ccccc2)cc1,6.742000,,,,parkinsons_5104
parkinsons_4228,C#Cc1ccc2c(c1)C(=O)N(C)Cc1c(C(=O)OC(C)(C)C)ncn-21,6.891333,,,,parkinsons_4228
parkinsons_332,NC1=Nc2c(cccc2)-c2c(cccc2)N1,,5.00,,,parkinsons_332
...,...,...,...,...,...,...
parkinsons_5659,O=C1NCN(c2ccccc2)C12CCN(C1Cc3c4c1cccc4ccc3)CC2,6.500000,,,,parkinsons_5659
parkinsons_4013,O=C(Nc1cc(-c2ccccc2)nn1-c1ccccc1)c1cccs1,,,6.70,,parkinsons_4013
parkinsons_4675,N=C(N)NN=Cc1ccc(O)cc1O,,3.67,,,parkinsons_4675
parkinsons_110,Cc1cccc(S(=O)(=O)N2CC=C(C#Cc3ncccc3)CC2)c1,,,8.09,,parkinsons_110


Being aware of the index will help you track down the compounds and associated data further down the line so get used to it. You can always reset the index to a custom column as well:

In [12]:
ds.setIndex(["SMILES"])
ds.getDF().index

Index(['O=C1NCCc2c1ccc(C#Cc1ccccc1F)c2',
       'CC(C)(C)c1cn(C2=NCC(=O)N3CCc4c(cccc4C4CCCO4)C3=C2)cn1',
       'COCc1c2c(cnc1N=C=S)[nH]c1c2cc(OCc2ccccc2)cc1',
       'C#Cc1ccc2c(c1)C(=O)N(C)Cc1c(C(=O)OC(C)(C)C)ncn-21',
       'NC1=Nc2c(cccc2)-c2c(cccc2)N1',
       'O=S1OCC2C(CO1)C1(Cl)C(Cl)=C(Cl)C2(Cl)C1(Cl)Cl',
       'CC(Oc1nnc(-c2cc(=O)n(C)cc2)n1C)c1noc(-c2cc(Cl)ccc2)n1',
       'O=C1NC(c2cc(C#CC3CC(F)(F)C3)cnc2)C(c2cccc(Cl)c2)O1',
       'Cc1c(COc2ccc(C(=O)N3CCS(=O)(=O)CC3)cn2)c(-c2cccc(F)c2)no1',
       'NC(C(=O)O)C12CC(C(=O)O)(C1)C2',
       ...
       'Oc1ccc(CC(O)CN2CCC(O)(Cc3ccc(Br)cc3)CC2)cc1',
       'c1ccc(-c2nnc3c4c(c(Oc5cnccc5)nn23)C2CCC4CC2)cc1',
       'Cc1cn2cc(C(F)(F)F)cc2c(C#Cc2cscc2)n1',
       'Cc1c2ncn(-c3cc4c(cc3)[nH]cc4CC3CCCN3C)c2ncc1',
       'COc1ccc(C(=O)N(C)c2nc(-c3ccncc3)cs2)cc1',
       'O=C1NCN(c2ccccc2)C12CCN(C1Cc3c4c1cccc4ccc3)CC2',
       'O=C(Nc1cc(-c2ccccc2)nn1-c1ccccc1)c1cccs1', 'N=C(N)NN=Cc1ccc(O)cc1O',
       'Cc1cccc(S(=O)(=O)N2CC=C(C#Cc3ncccc3

or even use multiple columns as index:

In [13]:
ds.setIndex(["SMILES", "QSPRID"])
ds.getDF().index

MultiIndex([(                           'O=C1NCCc2c1ccc(C#Cc1ccccc1F)c2', ...),
            (    'CC(C)(C)c1cn(C2=NCC(=O)N3CCc4c(cccc4C4CCCO4)C3=C2)cn1', ...),
            (             'COCc1c2c(cnc1N=C=S)[nH]c1c2cc(OCc2ccccc2)cc1', ...),
            (        'C#Cc1ccc2c(c1)C(=O)N(C)Cc1c(C(=O)OC(C)(C)C)ncn-21', ...),
            (                             'NC1=Nc2c(cccc2)-c2c(cccc2)N1', ...),
            (            'O=S1OCC2C(CO1)C1(Cl)C(Cl)=C(Cl)C2(Cl)C1(Cl)Cl', ...),
            (    'CC(Oc1nnc(-c2cc(=O)n(C)cc2)n1C)c1noc(-c2cc(Cl)ccc2)n1', ...),
            (       'O=C1NC(c2cc(C#CC3CC(F)(F)C3)cnc2)C(c2cccc(Cl)c2)O1', ...),
            ('Cc1c(COc2ccc(C(=O)N3CCS(=O)(=O)CC3)cn2)c(-c2cccc(F)c2)no1', ...),
            (                            'NC(C(=O)O)C12CC(C(=O)O)(C1)C2', ...),
            ...
            (              'Oc1ccc(CC(O)CN2CCC(O)(Cc3ccc(Br)cc3)CC2)cc1', ...),
            (          'c1ccc(-c2nnc3c4c(c(Oc5cnccc5)nn23)C2CCC4CC2)cc1', ...),
            (           

### Parallelization

The `PandasDataSet` class also provides a simple way to parallelize operations on the data frame. For example, you can easily apply a function to all rows of the data frame in parallel. All you need to do is initialize the `PandasDataSet` object with the number of CPUs you want to use:

In [14]:
ds = PandasDataSet(df=df, store_dir="data", name="parkinsons", index_cols=["QSPRID"], n_jobs=2)
ds.getDF()

Unnamed: 0_level_0,SMILES,GABAAalpha,NMDA,P41594,Q13255,QSPRID
QSPRID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
parkinsons_5856,O=C1NCCc2c1ccc(C#Cc1ccccc1F)c2,,,6.82,,parkinsons_5856
parkinsons_3991,CC(C)(C)c1cn(C2=NCC(=O)N3CCc4c(cccc4C4CCCO4)C3...,,,6.29,5.0,parkinsons_3991
parkinsons_5104,COCc1c2c(cnc1N=C=S)[nH]c1c2cc(OCc2ccccc2)cc1,6.742000,,,,parkinsons_5104
parkinsons_4228,C#Cc1ccc2c(c1)C(=O)N(C)Cc1c(C(=O)OC(C)(C)C)ncn-21,6.891333,,,,parkinsons_4228
parkinsons_332,NC1=Nc2c(cccc2)-c2c(cccc2)N1,,5.00,,,parkinsons_332
...,...,...,...,...,...,...
parkinsons_5659,O=C1NCN(c2ccccc2)C12CCN(C1Cc3c4c1cccc4ccc3)CC2,6.500000,,,,parkinsons_5659
parkinsons_4013,O=C(Nc1cc(-c2ccccc2)nn1-c1ccccc1)c1cccs1,,,6.70,,parkinsons_4013
parkinsons_4675,N=C(N)NN=Cc1ccc(O)cc1O,,3.67,,,parkinsons_4675
parkinsons_110,Cc1cccc(S(=O)(=O)N2CC=C(C#Cc3ncccc3)CC2)c1,,,8.09,,parkinsons_110


Now, all operations that are parallelized will be done in parallel. One example is the `apply` method that allows you to apply a function over a subset of rows of the data frame:

In [15]:
import time

def add_one_slow(x):
    """ Emulate a slow function."""

    time.sleep(0.01)

    return f"{x[1]}+{x[0]}"

now = time.perf_counter()
ds.apply(add_one_slow, subset=["SMILES", "QSPRID"], axis=1)
print(time.perf_counter() - now)

Parallel apply in progress for parkinsons.:   0%|          | 0/4 [00:00<?, ?it/s]

32.864222849000726


Compare this with the time it takes to do the same thing with only one CPU:

In [16]:
ds.nJobs = 1
now = time.perf_counter()
ds.apply(add_one_slow, subset=["SMILES", "QSPRID"], axis=1)
print(time.perf_counter() - now)

63.31759818000137


### Saving and Loading

The `PandasDataSet` class also provides a simple way to save and load data sets. You can save the data set and any associated data to a directory. By default this is the current directory, but you can specify a different directory upon creation of the data set:

In [17]:
ds = PandasDataSet(df=df, store_dir="data", name="parkinsons_pandas", index_cols=["QSPRID"])
ds.save()

'data/parkinsons_pandas_df.pkl'

Reloading the data set is easy as well. We just use its name to initialize a new `PandasDataSet` object:

In [18]:
ds = PandasDataSet(name="parkinsons_pandas", store_dir="data")
ds.getDF()

Unnamed: 0_level_0,SMILES,GABAAalpha,NMDA,P41594,Q13255,QSPRID
QSPRID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
parkinsons_5856,O=C1NCCc2c1ccc(C#Cc1ccccc1F)c2,,,6.82,,parkinsons_5856
parkinsons_3991,CC(C)(C)c1cn(C2=NCC(=O)N3CCc4c(cccc4C4CCCO4)C3...,,,6.29,5.0,parkinsons_3991
parkinsons_5104,COCc1c2c(cnc1N=C=S)[nH]c1c2cc(OCc2ccccc2)cc1,6.742000,,,,parkinsons_5104
parkinsons_4228,C#Cc1ccc2c(c1)C(=O)N(C)Cc1c(C(=O)OC(C)(C)C)ncn-21,6.891333,,,,parkinsons_4228
parkinsons_332,NC1=Nc2c(cccc2)-c2c(cccc2)N1,,5.00,,,parkinsons_332
...,...,...,...,...,...,...
parkinsons_5659,O=C1NCN(c2ccccc2)C12CCN(C1Cc3c4c1cccc4ccc3)CC2,6.500000,,,,parkinsons_5659
parkinsons_4013,O=C(Nc1cc(-c2ccccc2)nn1-c1ccccc1)c1cccs1,,,6.70,,parkinsons_4013
parkinsons_4675,N=C(N)NN=Cc1ccc(O)cc1O,,3.67,,,parkinsons_4675
parkinsons_110,Cc1cccc(S(=O)(=O)N2CC=C(C#Cc3ncccc3)CC2)c1,,,8.09,,parkinsons_110


## Data Representation (`MoleculeTable`)

Next extension of the `PandasDataSet` class is the `MoleculeTable` class. While `PandasDataSet` is a completely general class that can be used for any data set, `MoleculeTable` is specifically designed for data sets that contain molecular structures. It is a subclass of `PandasDataSet` and adds some useful functions for molecules. For example, you can easily convert the SMILES strings to RDKit molecules in your table:

In [19]:
from qsprpred.data.data import MoleculeTable

mt = MoleculeTable(df=df, store_dir="data", name="parkinsons", add_rdkit=True)
mt.getDF()

Unnamed: 0_level_0,SMILES,GABAAalpha,NMDA,P41594,Q13255,QSPRID,RDMol
QSPRID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
parkinsons_0,O=C1NCCc2c1ccc(C#Cc1ccccc1F)c2,,,6.82,,parkinsons_0,
parkinsons_1,CC(C)(C)c1cn(C2=NCC(=O)N3CCc4c(cccc4C4CCCO4)C3...,,,6.29,5.0,parkinsons_1,
parkinsons_2,COCc1c2c(cnc1N=C=S)[nH]c1c2cc(OCc2ccccc2)cc1,6.742000,,,,parkinsons_2,
parkinsons_3,C#Cc1ccc2c(c1)C(=O)N(C)Cc1c(C(=O)OC(C)(C)C)ncn-21,6.891333,,,,parkinsons_3,
parkinsons_4,NC1=Nc2c(cccc2)-c2c(cccc2)N1,,5.00,,,parkinsons_4,
...,...,...,...,...,...,...,...
parkinsons_6220,O=C1NCN(c2ccccc2)C12CCN(C1Cc3c4c1cccc4ccc3)CC2,6.500000,,,,parkinsons_6220,
parkinsons_6221,O=C(Nc1cc(-c2ccccc2)nn1-c1ccccc1)c1cccs1,,,6.70,,parkinsons_6221,
parkinsons_6222,N=C(N)NN=Cc1ccc(O)cc1O,,3.67,,,parkinsons_6222,
parkinsons_6223,Cc1cccc(S(=O)(=O)N2CC=C(C#Cc3ncccc3)CC2)c1,,,8.09,,parkinsons_6223,


However, in addition to the ability to depict structures in Jupyter notebooks, this also allows you to use the `MoleculeTable` class to calculate molecular descriptors and fingerprints, determine scaffolds and standardize structures. We will go over some of these features in the next sections.

### Dropping Invalid Molecules and Standardizing Structures

Before making calculations, it is a good idea to standardize structures and drop invalid molecules. `MoleculeTable` provides a simple way to do this:

In [20]:
mt.standardizeSmiles('chembl', drop_invalid=True)

The code above uses the ChEMBL standardizer to standardize the structures and drops all invalid molecules. You can also do it separately, though:

In [21]:
mt.standardizeSmiles('chembl', drop_invalid=False)
mt.dropInvalids()

QSPRID
parkinsons_0       False
parkinsons_1       False
parkinsons_2       False
parkinsons_3       False
parkinsons_4       False
                   ...  
parkinsons_6220    False
parkinsons_6221    False
parkinsons_6222    False
parkinsons_6223    False
parkinsons_6224    False
Name: SMILES, Length: 6225, dtype: bool

### Calculating Molecular Descriptors

QSPRPred provides an interface to easily calculate molecular descriptors. The package already contains many descriptor implementations, but you can also easily add your own. Here is an example that calculates Morgan fingerprints and RDKit descriptors:

In [22]:
from qsprpred.data.utils.descriptorsets import rdkit_descs, FingerprintSet
from qsprpred.data.utils.descriptorcalculator import MoleculeDescriptorsCalculator

calc = MoleculeDescriptorsCalculator(descsets=[FingerprintSet("MorganFP", radius=3, nBits=2048), rdkit_descs()])
mt.addDescriptors(calc)

**Note:** You can also speed these calculations up with the `n_jobs`/`nJobs` parameter/attribute since descriptor calculation is implemented to process multiple sets of molecules in parallel if `mt.nJobs` is higher than 1.

Descriptors are saved in their own wrapped tables, which can be accessed with the `descriptors` attribute:

In [23]:
mt.descriptors

[<qsprpred.data.data.DescriptorTable at 0x7fbb92cf9300>]

In [24]:
mt.descriptors[0].getDF().shape

(6225, 2257)

Adding more descriptors (i.e. protein descriptors) later will append to this list. You can easily get the whole matrix of descriptors as follows:

In [25]:
mt.getDescriptors()

Unnamed: 0_level_0,Descriptor_FingerprintSet_MorganFP_0,Descriptor_FingerprintSet_MorganFP_1,Descriptor_FingerprintSet_MorganFP_2,Descriptor_FingerprintSet_MorganFP_3,Descriptor_FingerprintSet_MorganFP_4,Descriptor_FingerprintSet_MorganFP_5,Descriptor_FingerprintSet_MorganFP_6,Descriptor_FingerprintSet_MorganFP_7,Descriptor_FingerprintSet_MorganFP_8,Descriptor_FingerprintSet_MorganFP_9,...,Descriptor_RDkit_fr_sulfide,Descriptor_RDkit_fr_sulfonamd,Descriptor_RDkit_fr_sulfone,Descriptor_RDkit_fr_term_acetylene,Descriptor_RDkit_fr_tetrazole,Descriptor_RDkit_fr_thiazole,Descriptor_RDkit_fr_thiocyan,Descriptor_RDkit_fr_thiophene,Descriptor_RDkit_fr_unbrch_alkane,Descriptor_RDkit_fr_urea
QSPRID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
parkinsons_0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
parkinsons_1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
parkinsons_2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
parkinsons_3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
parkinsons_4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
parkinsons_6220,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
parkinsons_6221,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
parkinsons_6222,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
parkinsons_6223,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
