# Data Representation

In this tutorial, you will learn how datasets are represented in QSPRpred.
This will help you to understand how to use the datasets in the library.

## Data Representation (`PandasDataTable`)

The package basically uses wrapped `pandas.DataFrame` objects with some useful functions added on top to facilitate features relevant for QSPR modeling. The `PandasDataTable` class is the  base class of all data sets in QSPRpred. Wrapping a `pandas.DataFrame` is easy:

In [1]:
import pandas as pd

df = pd.read_csv('../../tutorial_data/A2A_LIGANDS.tsv', sep='\t')

df.head()

Unnamed: 0,SMILES,pchembl_value_Mean,Year
0,Cc1nn(-c2cc(NC(=O)CCN(C)C)nc(-c3ccc(C)o3)n2)c(...,8.68,2008.0
1,Nc1c(C(=O)Nc2ccc([N+](=O)[O-])cc2)sc2c1cc1CCCC...,4.82,2010.0
2,O=C(Nc1nc2ncccc2n2c(=O)n(-c3ccccc3)nc12)c1ccccc1,5.65,2009.0
3,CNC(=O)C12CC1C(n1cnc3c1nc(C#CCCCCC(=O)OC)nc3NC...,5.45,2009.0
4,CCCn1c(=O)c2c(nc3cc(OC)ccn32)n(CCCNC(=O)c2ccc(...,5.2,2019.0


In [2]:
from qsprpred.data.tables.pandas import PandasDataTable
import os

random_state = 42
os.makedirs("../../tutorial_output/data", exist_ok=True)
dataset = PandasDataTable(df=df, store_dir="../../tutorial_output/data", name="RepresentationTutorialDataset", random_state=random_state)
dataset.getDF()

Unnamed: 0_level_0,SMILES,pchembl_value_Mean,Year,QSPRID
QSPRID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
RepresentationTutorialDataset_0000,Cc1nn(-c2cc(NC(=O)CCN(C)C)nc(-c3ccc(C)o3)n2)c(...,8.68,2008.0,RepresentationTutorialDataset_0000
RepresentationTutorialDataset_0001,Nc1c(C(=O)Nc2ccc([N+](=O)[O-])cc2)sc2c1cc1CCCC...,4.82,2010.0,RepresentationTutorialDataset_0001
RepresentationTutorialDataset_0002,O=C(Nc1nc2ncccc2n2c(=O)n(-c3ccccc3)nc12)c1ccccc1,5.65,2009.0,RepresentationTutorialDataset_0002
RepresentationTutorialDataset_0003,CNC(=O)C12CC1C(n1cnc3c1nc(C#CCCCCC(=O)OC)nc3NC...,5.45,2009.0,RepresentationTutorialDataset_0003
RepresentationTutorialDataset_0004,CCCn1c(=O)c2c(nc3cc(OC)ccn32)n(CCCNC(=O)c2ccc(...,5.20,2019.0,RepresentationTutorialDataset_0004
...,...,...,...,...
RepresentationTutorialDataset_4077,CNc1ncc(C(=O)NCc2ccc(OC)cc2)c2nc(-c3ccco3)nn12,7.09,2018.0,RepresentationTutorialDataset_4077
RepresentationTutorialDataset_4078,Nc1nc(-c2ccco2)c2ncn(C(=O)NCCc3ccccc3)c2n1,8.22,2008.0,RepresentationTutorialDataset_4078
RepresentationTutorialDataset_4079,Nc1nc(Nc2ccc(F)cc2)nc(CSc2nnc(N)s2)n1,4.89,2010.0,RepresentationTutorialDataset_4079
RepresentationTutorialDataset_4080,CCCOc1ccc(C=Cc2cc3c(c(=O)n(C)c(=O)n3C)n2C)cc1,6.51,2013.0,RepresentationTutorialDataset_4080


You can query this data set directly for simple information like the number of samples:

In [3]:
len(dataset)

4082

the saved properties/features:

In [4]:
dataset.getProperties()

Index(['SMILES', 'pchembl_value_Mean', 'Year', 'QSPRID'], dtype='object')

You can also do some operations on the data frame, like shuffle it:

In [5]:
dataset.shuffle()
dataset.getDF()

Unnamed: 0_level_0,SMILES,pchembl_value_Mean,Year,QSPRID
QSPRID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
RepresentationTutorialDataset_0599,CCCn1c(-c2ccccc2)nc2c1ncnc2NC1CCOC1,5.77,2018.0,RepresentationTutorialDataset_0599
RepresentationTutorialDataset_0752,CCCn1c(=O)c2c([nH]c(-c3c[nH]nc3)n2)n(CCC)c1=O,6.64,2006.0,RepresentationTutorialDataset_0752
RepresentationTutorialDataset_1954,COc1cccc2c1nc(N)n1nc(CN3CCN(c4ncc(F)cc4)CC3C)nc21,7.88,2015.0,RepresentationTutorialDataset_1954
RepresentationTutorialDataset_2928,COc1cccc(CCCC(=O)Nc2nc3c(cccc3)c(=O)s2)c1,6.94,2013.0,RepresentationTutorialDataset_2928
RepresentationTutorialDataset_2512,COc1c2nc(NC(=O)c3ccc(F)cc3)sc2c(N(CCO)C(C)=O)cc1,7.01,2010.0,RepresentationTutorialDataset_2512
...,...,...,...,...
RepresentationTutorialDataset_1130,CCNC(=O)C1OC(n2cnc3c2nc(C#CC2(O)CCCC2)nc3NCC)C...,6.03,2006.0,RepresentationTutorialDataset_1130
RepresentationTutorialDataset_1294,CNC(=O)C1SC(n2cnc3c2nc(Cl)nc3NCc2cc(I)ccc2)C(O...,6.65,2003.0,RepresentationTutorialDataset_1294
RepresentationTutorialDataset_0860,CCNC(=O)C1OC(n2cnc3c(N)nc(N4CCN(c5ccc(OCC(=O)O...,7.28,2015.0,RepresentationTutorialDataset_0860
RepresentationTutorialDataset_3507,CNC(=O)C1[Se]C(n2cnc3c2ncnc3NC2CCC2)C(O)C1O,5.97,2017.0,RepresentationTutorialDataset_3507


or fetch, drop and add some columns:

In [6]:
# get
year = dataset.getProperty("Year")
display(year)
# drop
dataset.removeProperty("Year")
display(dataset.getProperties())
# set
dataset.addProperty("Year", year)
display(dataset.getProperties())

QSPRID
RepresentationTutorialDataset_0599    2018.0
RepresentationTutorialDataset_0752    2006.0
RepresentationTutorialDataset_1954    2015.0
RepresentationTutorialDataset_2928    2013.0
RepresentationTutorialDataset_2512    2010.0
                                       ...  
RepresentationTutorialDataset_1130    2006.0
RepresentationTutorialDataset_1294    2003.0
RepresentationTutorialDataset_0860    2015.0
RepresentationTutorialDataset_3507    2017.0
RepresentationTutorialDataset_3174    1998.0
Name: Year, Length: 4082, dtype: float64

Index(['SMILES', 'pchembl_value_Mean', 'QSPRID'], dtype='object')

Index(['SMILES', 'pchembl_value_Mean', 'QSPRID', 'Year'], dtype='object')

However, you can always access the underlying data frame if more complex operations are needed:

In [7]:
df = dataset.getDF()
df

Unnamed: 0_level_0,SMILES,pchembl_value_Mean,QSPRID,Year
QSPRID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
RepresentationTutorialDataset_0599,CCCn1c(-c2ccccc2)nc2c1ncnc2NC1CCOC1,5.77,RepresentationTutorialDataset_0599,2018.0
RepresentationTutorialDataset_0752,CCCn1c(=O)c2c([nH]c(-c3c[nH]nc3)n2)n(CCC)c1=O,6.64,RepresentationTutorialDataset_0752,2006.0
RepresentationTutorialDataset_1954,COc1cccc2c1nc(N)n1nc(CN3CCN(c4ncc(F)cc4)CC3C)nc21,7.88,RepresentationTutorialDataset_1954,2015.0
RepresentationTutorialDataset_2928,COc1cccc(CCCC(=O)Nc2nc3c(cccc3)c(=O)s2)c1,6.94,RepresentationTutorialDataset_2928,2013.0
RepresentationTutorialDataset_2512,COc1c2nc(NC(=O)c3ccc(F)cc3)sc2c(N(CCO)C(C)=O)cc1,7.01,RepresentationTutorialDataset_2512,2010.0
...,...,...,...,...
RepresentationTutorialDataset_1130,CCNC(=O)C1OC(n2cnc3c2nc(C#CC2(O)CCCC2)nc3NCC)C...,6.03,RepresentationTutorialDataset_1130,2006.0
RepresentationTutorialDataset_1294,CNC(=O)C1SC(n2cnc3c2nc(Cl)nc3NCc2cc(I)ccc2)C(O...,6.65,RepresentationTutorialDataset_1294,2003.0
RepresentationTutorialDataset_0860,CCNC(=O)C1OC(n2cnc3c(N)nc(N4CCN(c5ccc(OCC(=O)O...,7.28,RepresentationTutorialDataset_0860,2015.0
RepresentationTutorialDataset_3507,CNC(=O)C1[Se]C(n2cnc3c2ncnc3NC2CCC2)C(O)C1O,5.97,RepresentationTutorialDataset_3507,2017.0


It is always possible to wrap it again:

In [8]:
dataset = PandasDataTable(df=df, store_dir="../../tutorial_output/data/", name="RepresentationTutorialDataset", random_state=random_state)
dataset

<qsprpred.data.data.PandasDataTable at 0x7fceb4654e50>

In [9]:
dataset.getDF()

Unnamed: 0_level_0,SMILES,pchembl_value_Mean,QSPRID,Year
QSPRID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
RepresentationTutorialDataset_0000,CCCn1c(-c2ccccc2)nc2c1ncnc2NC1CCOC1,5.77,RepresentationTutorialDataset_0000,2018.0
RepresentationTutorialDataset_0001,CCCn1c(=O)c2c([nH]c(-c3c[nH]nc3)n2)n(CCC)c1=O,6.64,RepresentationTutorialDataset_0001,2006.0
RepresentationTutorialDataset_0002,COc1cccc2c1nc(N)n1nc(CN3CCN(c4ncc(F)cc4)CC3C)nc21,7.88,RepresentationTutorialDataset_0002,2015.0
RepresentationTutorialDataset_0003,COc1cccc(CCCC(=O)Nc2nc3c(cccc3)c(=O)s2)c1,6.94,RepresentationTutorialDataset_0003,2013.0
RepresentationTutorialDataset_0004,COc1c2nc(NC(=O)c3ccc(F)cc3)sc2c(N(CCO)C(C)=O)cc1,7.01,RepresentationTutorialDataset_0004,2010.0
...,...,...,...,...
RepresentationTutorialDataset_4077,CCNC(=O)C1OC(n2cnc3c2nc(C#CC2(O)CCCC2)nc3NCC)C...,6.03,RepresentationTutorialDataset_4077,2006.0
RepresentationTutorialDataset_4078,CNC(=O)C1SC(n2cnc3c2nc(Cl)nc3NCc2cc(I)ccc2)C(O...,6.65,RepresentationTutorialDataset_4078,2003.0
RepresentationTutorialDataset_4079,CCNC(=O)C1OC(n2cnc3c(N)nc(N4CCN(c5ccc(OCC(=O)O...,7.28,RepresentationTutorialDataset_4079,2015.0
RepresentationTutorialDataset_4080,CNC(=O)C1[Se]C(n2cnc3c2ncnc3NC2CCC2)C(O)C1O,5.97,RepresentationTutorialDataset_4080,2017.0


### Data Indexing

You might have noticed that when recreated again from a new data frame the "QSPRID" column was reset while the data still remained in the same order as after shuffling. This is because index gets automatically reset when a new `PandasDataTable` object is created. However, you can always set the index to a specific column when creating the data set:

In [10]:
dataset.shuffle()
df = dataset.getDF()
df

Unnamed: 0_level_0,SMILES,pchembl_value_Mean,QSPRID,Year
QSPRID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
RepresentationTutorialDataset_0599,Cc1ccc(Nc2nc3c(cccc3)c3c2nc(C2CCCC2)[nH]3)cc1,5.28,RepresentationTutorialDataset_0599,2006.0
RepresentationTutorialDataset_0752,CC(=O)Nc1ccc(Cn2nnc3c2nc(N)nc3-c2ccco2)cc1,6.94,RepresentationTutorialDataset_0752,2009.0
RepresentationTutorialDataset_1954,CCNC(=O)C1OC(n2cnc3c(NCC)nc(C#CC(O)C4CCCCC4)nc...,7.23,RepresentationTutorialDataset_1954,2006.0
RepresentationTutorialDataset_2928,OCC1OC(n2cnc3c2ncnc3NC2CCSC2)C(O)C1O,5.01,RepresentationTutorialDataset_2928,2007.0
RepresentationTutorialDataset_2512,Cc1cc(-c2nc3c(ncn3C3OC(Cn4nc(C)cc4C)C(O)C3O)c(...,5.37,RepresentationTutorialDataset_2512,2018.0
...,...,...,...,...
RepresentationTutorialDataset_1130,CC(C)n1cnc(CCNc2nc3c(ncn3C3CC(NC(=O)Cc4ccccc4)...,5.78,RepresentationTutorialDataset_1130,2010.0
RepresentationTutorialDataset_1294,O=C(NC1CCC1)C1SC(n2cnc3c2nc(Cl)nc3NCc2cc(I)ccc...,6.26,RepresentationTutorialDataset_1294,2006.0
RepresentationTutorialDataset_0860,COCCNC(=O)c1cc2c(oc1=N)c(OC)ccc2,6.30,RepresentationTutorialDataset_0860,2012.0
RepresentationTutorialDataset_3507,Nc1nc(-c2ccco2)c2ncn(CCc3ccccc3)c2n1,6.57,RepresentationTutorialDataset_3507,2005.0


In [11]:
dataset = PandasDataTable(df=df, store_dir="../../tutorial_output/data", name="RepresentationTutorialDataset", index_cols=["QSPRID"], random_state=random_state)
dataset.getDF()

Unnamed: 0_level_0,SMILES,pchembl_value_Mean,QSPRID,Year
QSPRID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
RepresentationTutorialDataset_0599,Cc1ccc(Nc2nc3c(cccc3)c3c2nc(C2CCCC2)[nH]3)cc1,5.28,RepresentationTutorialDataset_0599,2006.0
RepresentationTutorialDataset_0752,CC(=O)Nc1ccc(Cn2nnc3c2nc(N)nc3-c2ccco2)cc1,6.94,RepresentationTutorialDataset_0752,2009.0
RepresentationTutorialDataset_1954,CCNC(=O)C1OC(n2cnc3c(NCC)nc(C#CC(O)C4CCCCC4)nc...,7.23,RepresentationTutorialDataset_1954,2006.0
RepresentationTutorialDataset_2928,OCC1OC(n2cnc3c2ncnc3NC2CCSC2)C(O)C1O,5.01,RepresentationTutorialDataset_2928,2007.0
RepresentationTutorialDataset_2512,Cc1cc(-c2nc3c(ncn3C3OC(Cn4nc(C)cc4C)C(O)C3O)c(...,5.37,RepresentationTutorialDataset_2512,2018.0
...,...,...,...,...
RepresentationTutorialDataset_1130,CC(C)n1cnc(CCNc2nc3c(ncn3C3CC(NC(=O)Cc4ccccc4)...,5.78,RepresentationTutorialDataset_1130,2010.0
RepresentationTutorialDataset_1294,O=C(NC1CCC1)C1SC(n2cnc3c2nc(Cl)nc3NCc2cc(I)ccc...,6.26,RepresentationTutorialDataset_1294,2006.0
RepresentationTutorialDataset_0860,COCCNC(=O)c1cc2c(oc1=N)c(OC)ccc2,6.30,RepresentationTutorialDataset_0860,2012.0
RepresentationTutorialDataset_3507,Nc1nc(-c2ccco2)c2ncn(CCc3ccccc3)c2n1,6.57,RepresentationTutorialDataset_3507,2005.0


Being aware of the index will help you track down the compounds and associated data further down the line. You can always reset the index to a custom column as well:

or even use multiple columns as index:

In [12]:
dataset.setIndex(["SMILES", "QSPRID"])
dataset.getDF().index

MultiIndex([(                                      'Cc1ccc(Nc2nc3c(cccc3)c3c2nc(C2CCCC2)[nH]3)cc1', ...),
            (                                         'CC(=O)Nc1ccc(Cn2nnc3c2nc(N)nc3-c2ccco2)cc1', ...),
            (                           'CCNC(=O)C1OC(n2cnc3c(NCC)nc(C#CC(O)C4CCCCC4)nc32)C(O)C1O', ...),
            (                                               'OCC1OC(n2cnc3c2ncnc3NC2CCSC2)C(O)C1O', ...),
            (          'Cc1cc(-c2nc3c(ncn3C3OC(Cn4nc(C)cc4C)C(O)C3O)c(Nc3c(F)cc(Cl)cc3)n2)cc(C)c1', ...),
            (                               'CCNC(=O)C1OC(n2cnc3c2nc(Cl)nc3NNC(=O)c2cccs2)C(O)C1O', ...),
            (                                 'CCC(=O)Nc1nc(-c2ccc3OCOc3c2)nc(-c2cc3c(cc2)OCO3)c1', ...),
            (                                      'COc1c2cccc(CNC(=O)c3nc(N)nc4c(F)cccc34)c2ncc1', ...),
            (                                   'COc1cc(-c2cc(NC(C)=O)nc(-n3nc(C)cc3C)n2)cc(OC)c1', ...),
            (                                 

### Parallelization

The `PandasDataTable` class also provides a simple way to parallelize operations on the data frame. For example, you can easily apply a function to all rows of the data frame in parallel. All you need to do is initialize the `PandasDataTable` object with the number of CPUs you want to use:

In [13]:
dataset = PandasDataTable(df=df, store_dir="../../tutorial_output/data/", name="RepresentationTutorialDataset", index_cols=["QSPRID"], n_jobs=2, random_state=random_state)
df = dataset.getDF()
df

Unnamed: 0_level_0,SMILES,pchembl_value_Mean,QSPRID,Year
QSPRID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
RepresentationTutorialDataset_0599,Cc1ccc(Nc2nc3c(cccc3)c3c2nc(C2CCCC2)[nH]3)cc1,5.28,RepresentationTutorialDataset_0599,2006.0
RepresentationTutorialDataset_0752,CC(=O)Nc1ccc(Cn2nnc3c2nc(N)nc3-c2ccco2)cc1,6.94,RepresentationTutorialDataset_0752,2009.0
RepresentationTutorialDataset_1954,CCNC(=O)C1OC(n2cnc3c(NCC)nc(C#CC(O)C4CCCCC4)nc...,7.23,RepresentationTutorialDataset_1954,2006.0
RepresentationTutorialDataset_2928,OCC1OC(n2cnc3c2ncnc3NC2CCSC2)C(O)C1O,5.01,RepresentationTutorialDataset_2928,2007.0
RepresentationTutorialDataset_2512,Cc1cc(-c2nc3c(ncn3C3OC(Cn4nc(C)cc4C)C(O)C3O)c(...,5.37,RepresentationTutorialDataset_2512,2018.0
...,...,...,...,...
RepresentationTutorialDataset_1130,CC(C)n1cnc(CCNc2nc3c(ncn3C3CC(NC(=O)Cc4ccccc4)...,5.78,RepresentationTutorialDataset_1130,2010.0
RepresentationTutorialDataset_1294,O=C(NC1CCC1)C1SC(n2cnc3c2nc(Cl)nc3NCc2cc(I)ccc...,6.26,RepresentationTutorialDataset_1294,2006.0
RepresentationTutorialDataset_0860,COCCNC(=O)c1cc2c(oc1=N)c(OC)ccc2,6.30,RepresentationTutorialDataset_0860,2012.0
RepresentationTutorialDataset_3507,Nc1nc(-c2ccco2)c2ncn(CCc3ccccc3)c2n1,6.57,RepresentationTutorialDataset_3507,2005.0


Now, all operations that are parallelized will be done in parallel. One example is the `apply` method that allows you to apply a function over a subset of rows of the data frame:

In [14]:
# note this example does not work on windows
import time

def add_one_slow(x):
    """ Emulate a slow function."""

    time.sleep(0.001) # 100ms delay

    return f"{x[1]}+{x[0]}" # simply concatenate the values from the two columns in our subset

now = time.perf_counter()
res = dataset.apply(add_one_slow, subset=["SMILES", "QSPRID"], axis=1)
print(time.perf_counter() - now)

Parallel apply in progress for RepresentationTutorialDataset.:   0%|          | 0/3 [00:00<?, ?it/s]

  return f"{x[1]}+{x[0]}" # simply concatenate the values from the two columns in our subset
  return f"{x[1]}+{x[0]}" # simply concatenate the values from the two columns in our subset
  return f"{x[1]}+{x[0]}" # simply concatenate the values from the two columns in our subset
  return f"{x[1]}+{x[0]}" # simply concatenate the values from the two columns in our subset
  return f"{x[1]}+{x[0]}" # simply concatenate the values from the two columns in our subset


2.429115056991577


In [15]:
res

QSPRID
RepresentationTutorialDataset_0599    RepresentationTutorialDataset_0599+Cc1ccc(Nc2n...
RepresentationTutorialDataset_0752    RepresentationTutorialDataset_0752+CC(=O)Nc1cc...
RepresentationTutorialDataset_1954    RepresentationTutorialDataset_1954+CCNC(=O)C1O...
RepresentationTutorialDataset_2928    RepresentationTutorialDataset_2928+OCC1OC(n2cn...
RepresentationTutorialDataset_2512    RepresentationTutorialDataset_2512+Cc1cc(-c2nc...
                                                            ...                        
RepresentationTutorialDataset_1130    RepresentationTutorialDataset_1130+CC(C)n1cnc(...
RepresentationTutorialDataset_1294    RepresentationTutorialDataset_1294+O=C(NC1CCC1...
RepresentationTutorialDataset_0860    RepresentationTutorialDataset_0860+COCCNC(=O)c...
RepresentationTutorialDataset_3507    RepresentationTutorialDataset_3507+Nc1nc(-c2cc...
RepresentationTutorialDataset_3174    RepresentationTutorialDataset_3174+Cn1cc2c(n1)...
Length: 4082, dtype: obje

Compare this with the time it takes to do the same thing with only one CPU:

In [16]:
dataset.nJobs = 1
now = time.perf_counter()
dataset.apply(add_one_slow, subset=["SMILES", "QSPRID"], axis=1)
print(time.perf_counter() - now)

  return f"{x[1]}+{x[0]}" # simply concatenate the values from the two columns in our subset


4.639329973841086


### Saving and Loading

The `PandasDataTable` class also provides a simple way to save and load data sets. You can save the data set and any associated data to a directory. By default this is the current directory, but you can specify a different directory upon creation of the data set:

In [17]:
dataset = PandasDataTable(df=df, store_dir="../../tutorial_output/data/", name="RepresentationTutorialDataset", index_cols=["QSPRID"], random_state=random_state)
dataset.save()

'/zfsdata/data/helle/01_MainProjects/03_QSPRPred/Scripts/QSPRpred/tutorials/tutorial_output/data/RepresentationTutorialDataset/RepresentationTutorialDataset_meta.json'

Reloading the data set is easy as well. We just use its name to initialize a new `PandasDataTable` object:

In [18]:
dataset = PandasDataTable(name="RepresentationTutorialDataset", store_dir="../../tutorial_output/data/")
dataset.getDF()

Unnamed: 0_level_0,SMILES,pchembl_value_Mean,QSPRID,Year
QSPRID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
RepresentationTutorialDataset_0599,Cc1ccc(Nc2nc3c(cccc3)c3c2nc(C2CCCC2)[nH]3)cc1,5.28,RepresentationTutorialDataset_0599,2006.0
RepresentationTutorialDataset_0752,CC(=O)Nc1ccc(Cn2nnc3c2nc(N)nc3-c2ccco2)cc1,6.94,RepresentationTutorialDataset_0752,2009.0
RepresentationTutorialDataset_1954,CCNC(=O)C1OC(n2cnc3c(NCC)nc(C#CC(O)C4CCCCC4)nc...,7.23,RepresentationTutorialDataset_1954,2006.0
RepresentationTutorialDataset_2928,OCC1OC(n2cnc3c2ncnc3NC2CCSC2)C(O)C1O,5.01,RepresentationTutorialDataset_2928,2007.0
RepresentationTutorialDataset_2512,Cc1cc(-c2nc3c(ncn3C3OC(Cn4nc(C)cc4C)C(O)C3O)c(...,5.37,RepresentationTutorialDataset_2512,2018.0
...,...,...,...,...
RepresentationTutorialDataset_1130,CC(C)n1cnc(CCNc2nc3c(ncn3C3CC(NC(=O)Cc4ccccc4)...,5.78,RepresentationTutorialDataset_1130,2010.0
RepresentationTutorialDataset_1294,O=C(NC1CCC1)C1SC(n2cnc3c2nc(Cl)nc3NCc2cc(I)ccc...,6.26,RepresentationTutorialDataset_1294,2006.0
RepresentationTutorialDataset_0860,COCCNC(=O)c1cc2c(oc1=N)c(OC)ccc2,6.30,RepresentationTutorialDataset_0860,2012.0
RepresentationTutorialDataset_3507,Nc1nc(-c2ccco2)c2ncn(CCc3ccccc3)c2n1,6.57,RepresentationTutorialDataset_3507,2005.0


## Data Representation (`MoleculeTable`)

Next extension of the `PandasDataTable` class is the `MoleculeTable` class. While `PandasDataTable` is a completely general class that can be used for any data set, `MoleculeTable` is specifically designed for data sets that contain molecular structures. It is a subclass of `PandasDataTable` and adds some useful functions for molecules. For example, you can easily convert the SMILES strings to RDKit molecules in your table:

In [19]:
from qsprpred.data import MoleculeTable

mt = MoleculeTable(df=df, store_dir="../../tutorial_output/data/", name="RepresentationTutorialDataset", add_rdkit=True, overwrite=True, random_state=random_state)
mt.getDF()

  for col in df_subset.columns[df_subset.applymap(MolFormatter.is_mol).any()]


Unnamed: 0_level_0,SMILES,pchembl_value_Mean,QSPRID,Year,RDMol
QSPRID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
RepresentationTutorialDataset_0000,Cc1ccc(Nc2nc3c(cccc3)c3c2nc(C2CCCC2)[nH]3)cc1,5.28,RepresentationTutorialDataset_0000,2006.0,
RepresentationTutorialDataset_0001,CC(=O)Nc1ccc(Cn2nnc3c2nc(N)nc3-c2ccco2)cc1,6.94,RepresentationTutorialDataset_0001,2009.0,
RepresentationTutorialDataset_0002,CCNC(=O)C1OC(n2cnc3c(NCC)nc(C#CC(O)C4CCCCC4)nc...,7.23,RepresentationTutorialDataset_0002,2006.0,
RepresentationTutorialDataset_0003,OCC1OC(n2cnc3c2ncnc3NC2CCSC2)C(O)C1O,5.01,RepresentationTutorialDataset_0003,2007.0,
RepresentationTutorialDataset_0004,Cc1cc(-c2nc3c(ncn3C3OC(Cn4nc(C)cc4C)C(O)C3O)c(...,5.37,RepresentationTutorialDataset_0004,2018.0,
...,...,...,...,...,...
RepresentationTutorialDataset_4077,CC(C)n1cnc(CCNc2nc3c(ncn3C3CC(NC(=O)Cc4ccccc4)...,5.78,RepresentationTutorialDataset_4077,2010.0,
RepresentationTutorialDataset_4078,O=C(NC1CCC1)C1SC(n2cnc3c2nc(Cl)nc3NCc2cc(I)ccc...,6.26,RepresentationTutorialDataset_4078,2006.0,
RepresentationTutorialDataset_4079,COCCNC(=O)c1cc2c(oc1=N)c(OC)ccc2,6.30,RepresentationTutorialDataset_4079,2012.0,
RepresentationTutorialDataset_4080,Nc1nc(-c2ccco2)c2ncn(CCc3ccccc3)c2n1,6.57,RepresentationTutorialDataset_4080,2005.0,


However, in addition to the ability to depict structures in Jupyter notebooks, this also allows you to use the `MoleculeTable` class to calculate molecular descriptors and fingerprints, determine scaffoldataset and standardize structures. We will go over some of these features in the next sections.

### Dropping Invalid Molecules and Standardizing Structures

Before making calculations, it is a good idea to standardize structures and drop invalid molecules. `MoleculeTable` provides a simple way to do this:

In [20]:
mt.standardizeSmiles('chembl', drop_invalid=True)

The code above uses the ChEMBL standardizer to standardize the structures and drops all invalid molecules. You can also do it separately, though:

In [21]:
mt.standardizeSmiles('chembl', drop_invalid=False)
mt.dropInvalids() # returns a boolean array of the same length as the unfiltered data frame that indicates which rows were dropped (True) and which were kept (False)

QSPRID
RepresentationTutorialDataset_0000    False
RepresentationTutorialDataset_0001    False
RepresentationTutorialDataset_0002    False
RepresentationTutorialDataset_0003    False
RepresentationTutorialDataset_0004    False
                                      ...  
RepresentationTutorialDataset_4077    False
RepresentationTutorialDataset_4078    False
RepresentationTutorialDataset_4079    False
RepresentationTutorialDataset_4080    False
RepresentationTutorialDataset_4081    False
Name: SMILES, Length: 4082, dtype: bool

### Calculating Molecular Descriptors

QSPRPred provides an interface to easily calculate molecular descriptors. The package already contains many descriptor implementations, but you can also easily add your own. See the [descriptor tutorial](descriptors.ipynb) for more information on this topic. Here is an example that calculates Morgan fingerprints and RDKit descriptors:

In [22]:
from qsprpred.data.descriptors.sets import FingerprintSet, RDKitDescs
from qsprpred.data.descriptors.calculators import MoleculeDescriptorsCalculator

calc = MoleculeDescriptorsCalculator(desc_sets=[FingerprintSet("MorganFP", radius=3, nBits=2048), RDKitDescs()])
mt.addDescriptors(calc)

**Note:** You can also speed these calculations up with the `n_jobs`/`nJobs` parameter/attribute since descriptor calculation is implemented to process multiple sets of molecules in parallel if `mt.nJobs` is higher than 1.

Descriptors are kept in their own wrapped tables, which can be accessed with the `descriptors` attribute. On saving the data set, the descriptors are also saved to the same directory.

In [23]:
mt.descriptors

[<qsprpred.data.data.DescriptorTable at 0x7fce6f6b8b10>]

In [24]:
mt.descriptors[0].getDF().shape

(4082, 2259)

Adding more descriptors (i.e. [protein descriptors](data_preparation_advanced.ipynb)) later will append to this list. You can easily get the whole matrix of descriptors as follows:

In [25]:
mt.getDescriptors()

Unnamed: 0_level_0,Descriptor_FingerprintSet_MorganFP_0,Descriptor_FingerprintSet_MorganFP_1,Descriptor_FingerprintSet_MorganFP_2,Descriptor_FingerprintSet_MorganFP_3,Descriptor_FingerprintSet_MorganFP_4,Descriptor_FingerprintSet_MorganFP_5,Descriptor_FingerprintSet_MorganFP_6,Descriptor_FingerprintSet_MorganFP_7,Descriptor_FingerprintSet_MorganFP_8,Descriptor_FingerprintSet_MorganFP_9,...,Descriptor_RDkit_fr_sulfonamd,Descriptor_RDkit_fr_sulfone,Descriptor_RDkit_fr_term_acetylene,Descriptor_RDkit_fr_tetrazole,Descriptor_RDkit_fr_thiazole,Descriptor_RDkit_fr_thiocyan,Descriptor_RDkit_fr_thiophene,Descriptor_RDkit_fr_unbrch_alkane,Descriptor_RDkit_fr_urea,Descriptor_RDkit_qed
QSPRID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
RepresentationTutorialDataset_0000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.493001
RepresentationTutorialDataset_0001,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.577514
RepresentationTutorialDataset_0002,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.373546
RepresentationTutorialDataset_0003,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.567398
RepresentationTutorialDataset_0004,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.263458
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
RepresentationTutorialDataset_4077,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.106870
RepresentationTutorialDataset_4078,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.251558
RepresentationTutorialDataset_4079,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.806020
RepresentationTutorialDataset_4080,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.626544


## Data Representation (`QSPRDataset`)

The `QSPRDataset` class is the next extension of the `PandasDataTable` and `MoleculeTable` classes. It is a subclass of `MoleculeTable` and adds some useful functions for QSPR model training itself, but of course, the data preparation steps covered for the `PandasDataset` and the `MoleculeTable` classes are also applicable to the `QSPRDataset` class. You can create a `QSPRDataset` object from scratch as usual, but this time you will need to specify the target properties and tasks you would like to model. For example, a data set for a simple regression task would be defined as follows:

In [26]:
from qsprpred.data import QSPRDataset
from qsprpred import TargetTasks, TargetProperty

dataset_qspr = QSPRDataset(df=df, store_dir="../../tutorial_output/data/", name="RepresentationTutorialDataset", target_props=[{"name": "pchembl_value_Mean", "task": "REGRESSION"}], random_state=random_state)
# or
dataset_qspr = QSPRDataset(df=df, store_dir="../../tutorial_output/data/", name="RepresentationTutorialDataset", target_props=[TargetProperty("pchembl_value_Mean", TargetTasks.REGRESSION)], random_state=random_state)
dataset_qspr.getDF()

  for col in df_subset.columns[df_subset.applymap(MolFormatter.is_mol).any()]


Unnamed: 0_level_0,SMILES,pchembl_value_Mean,QSPRID,Year,RDMol
QSPRID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
RepresentationTutorialDataset_0000,Cc1ccc(Nc2nc3ccccc3c3[nH]c(C4CCCC4)nc23)cc1,5.28,RepresentationTutorialDataset_0000,2006.0,
RepresentationTutorialDataset_0001,CC(=O)Nc1ccc(Cn2nnc3c(-c4ccco4)nc(N)nc32)cc1,6.94,RepresentationTutorialDataset_0001,2009.0,
RepresentationTutorialDataset_0002,CCNC(=O)C1OC(n2cnc3c(NCC)nc(C#CC(O)C4CCCCC4)nc...,7.23,RepresentationTutorialDataset_0002,2006.0,
RepresentationTutorialDataset_0003,OCC1OC(n2cnc3c(NC4CCSC4)ncnc32)C(O)C1O,5.01,RepresentationTutorialDataset_0003,2007.0,
RepresentationTutorialDataset_0004,Cc1cc(C)cc(-c2nc(Nc3ccc(Cl)cc3F)c3ncn(C4OC(Cn5...,5.37,RepresentationTutorialDataset_0004,2018.0,
...,...,...,...,...,...
RepresentationTutorialDataset_4077,CC(C)n1cnc(CCNc2nc(NCC(c3ccccc3)c3ccccc3)c3ncn...,5.78,RepresentationTutorialDataset_4077,2010.0,
RepresentationTutorialDataset_4078,O=C(NC1CCC1)C1SC(n2cnc3c(NCc4cccc(I)c4)nc(Cl)n...,6.26,RepresentationTutorialDataset_4078,2006.0,
RepresentationTutorialDataset_4079,COCCNC(=O)c1cc2cccc(OC)c2oc1=N,6.30,RepresentationTutorialDataset_4079,2012.0,
RepresentationTutorialDataset_4080,Nc1nc(-c2ccco2)c2ncn(CCc3ccccc3)c2n1,6.57,RepresentationTutorialDataset_4080,2005.0,


You can see that some rows from the original data frame were dropped automatically because they did not have a value for the specified `GABAAalpha` target property.

You may also notice that in this data set we are now missing our descriptors:

In [27]:
dataset_qspr.descriptors

[]

In [28]:
dataset_qspr.getDescriptors()

RepresentationTutorialDataset_0000
RepresentationTutorialDataset_0001
RepresentationTutorialDataset_0002
RepresentationTutorialDataset_0003
RepresentationTutorialDataset_0004
...
RepresentationTutorialDataset_4077
RepresentationTutorialDataset_4078
RepresentationTutorialDataset_4079
RepresentationTutorialDataset_4080
RepresentationTutorialDataset_4081


This is because the `QSPRDataset` class does not know anything about the descriptors in our `MoleculeTable` object since it only uses the original data frame with the molecules and target properties. Therefore, there is also the `fromMolTable` method that allows you to create a `QSPRDataset` object from a `MoleculeTable` object while maintaining all data associated with it:

In [29]:
dataset_qspr = QSPRDataset.fromMolTable(mt, target_props=[TargetProperty("pchembl_value_Mean", TargetTasks.REGRESSION)])
dataset_qspr.descriptors

[<qsprpred.data.data.DescriptorTable at 0x7fce6f6b8b10>]

In [30]:
dataset_qspr.getDescriptors()

Unnamed: 0_level_0,Descriptor_FingerprintSet_MorganFP_0,Descriptor_FingerprintSet_MorganFP_1,Descriptor_FingerprintSet_MorganFP_2,Descriptor_FingerprintSet_MorganFP_3,Descriptor_FingerprintSet_MorganFP_4,Descriptor_FingerprintSet_MorganFP_5,Descriptor_FingerprintSet_MorganFP_6,Descriptor_FingerprintSet_MorganFP_7,Descriptor_FingerprintSet_MorganFP_8,Descriptor_FingerprintSet_MorganFP_9,...,Descriptor_RDkit_fr_sulfonamd,Descriptor_RDkit_fr_sulfone,Descriptor_RDkit_fr_term_acetylene,Descriptor_RDkit_fr_tetrazole,Descriptor_RDkit_fr_thiazole,Descriptor_RDkit_fr_thiocyan,Descriptor_RDkit_fr_thiophene,Descriptor_RDkit_fr_unbrch_alkane,Descriptor_RDkit_fr_urea,Descriptor_RDkit_qed
QSPRID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
RepresentationTutorialDataset_0000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.493001
RepresentationTutorialDataset_0001,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.577514
RepresentationTutorialDataset_0002,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.373546
RepresentationTutorialDataset_0003,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.567398
RepresentationTutorialDataset_0004,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.263458
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
RepresentationTutorialDataset_4077,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.106870
RepresentationTutorialDataset_4078,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.251558
RepresentationTutorialDataset_4079,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.806020
RepresentationTutorialDataset_4080,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.626544


Now you know how data sets are represented in QSPRpred. Before you start modelling, you should also check out the [data preparation tutorial](data_preparation.ipynb) to learn how to prepare your data sets for modelling. This tutorial covers additional preparation steps such as data filtering, feature selection and standardization through the `QSPRDataset.prepareDataset` method.