## Data Representation API (`PropertyStorage` and `ChemStore`)

### Overview

When designing the storage API we tried to identify the most common tasks that need to be performed when working with diverse cheminformatics data sets, mainly in the context of QSPR modelling, but it can also be used to store data from molecular docking or other structure-based simulations. Therefore, QSPRpred defines a general API to register and store properties (independent variables) for arbitrary data entries in its `PropertyStorage` abstract class, which is then further extended by the `ChemStore` interface that supports more specific functionality for encoding molecules alongside their properties. If you take a look at the [API documentation](https://cddleiden.github.io/QSPRpred/docs/api/modules.html) of these classes, you can see the methods and attributes to interact with them. Therefore, anyone can implement any kind of storage system to store compound representations and their properties and as long as they adhere to the above interfaces, their storage system can be used in QSPRpred seamlessly. This potentially enables more advanced users to interface different storage backends (i.e. SQL databases, NoSQL databases, online REST APIs or prohibitively large data sets) with QSPRpred as well. Since this is more advanced functionality, it is not yet covered in this tutorial, which only focuses on currently available implementations that focus on storing data locally by the means of `pandas` data frames. However, we are happy for any inquiries about developing clients for custom APIs or databases. Let us know on the [issue tracker](https://github.com/CDDLeiden/QSPRpred/issues) or via [email](https://github.com/CDDLeiden/QSPRpred/blob/main/pyproject.toml).

### `PandasDataTable` as `PropertyStorage`

**Note: Feel free to skip this part of the tutorial and continue to the "`TabularStorageBasic` as `ChemStore`" section if you are more interested in the cheminformatics features of QSPRpred and are not interested in understanding `PropertyStorage` in detail.**

Tabular data is the most common data type in QSPR modelling and `pandas` is the Python package of choice when it comes to processing it. Therefore, we decided to compose the default `PropertyStorage` implementation around it and provide a light wrapper for the `pandas.DataFrame` class called `PandasDataTable`. `PandasDataTable` objects simply manage storage and state of a given `pandas.DataFrame` and giving it all features of the `PropertyStorage` API at the same time. You will typically not interact with these objects directly, but we will now use it for the demonstration of some functions facilitated by the `PropertyStorage` API. We will use the `A2A_LIGANDS.tsv` file from the tutorial data folder as an example data set. This file contains a list of ligands for the adenosine A2A receptor, which is a common target in drug discovery. The data set contains SMILES strings and some other properties relevant for QSPR modelling:

In [1]:
import pandas as pd

df = pd.read_csv("../../tutorial_data/A2A_LIGANDS.tsv", sep="\t")

df.head()

Unnamed: 0,SMILES,pchembl_value_Mean,Year
0,Cc1nn(-c2cc(NC(=O)CCN(C)C)nc(-c3ccc(C)o3)n2)c(...,8.68,2008.0
1,Nc1c(C(=O)Nc2ccc([N+](=O)[O-])cc2)sc2c1cc1CCCC...,4.82,2010.0
2,O=C(Nc1nc2ncccc2n2c(=O)n(-c3ccccc3)nc12)c1ccccc1,5.65,2009.0
3,CNC(=O)C12CC1C(n1cnc3c1nc(C#CCCCCC(=O)OC)nc3NC...,5.45,2009.0
4,CCCn1c(=O)c2c(nc3cc(OC)ccn32)n(CCCNC(=O)c2ccc(...,5.2,2019.0


Wrapping this data frame in a `PandasDataTable` object is simple:

In [2]:
from qsprpred.data.tables.pandas import PandasDataTable
import os

random_state = 42  # for reproducibility of all random operations
os.makedirs("../../tutorial_output/data",
            exist_ok=True)  # create the output directory if it does not exist yet
dataset = PandasDataTable(df=df, store_dir="../../tutorial_output/data",
                          name="RepresentationTutorialDataset",
                          random_state=random_state)
dataset.getDF()

Unnamed: 0_level_0,SMILES,pchembl_value_Mean,Year,ID
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
RepresentationTutorialDataset_0000,Cc1nn(-c2cc(NC(=O)CCN(C)C)nc(-c3ccc(C)o3)n2)c(...,8.68,2008.0,RepresentationTutorialDataset_0000
RepresentationTutorialDataset_0001,Nc1c(C(=O)Nc2ccc([N+](=O)[O-])cc2)sc2c1cc1CCCC...,4.82,2010.0,RepresentationTutorialDataset_0001
RepresentationTutorialDataset_0002,O=C(Nc1nc2ncccc2n2c(=O)n(-c3ccccc3)nc12)c1ccccc1,5.65,2009.0,RepresentationTutorialDataset_0002
RepresentationTutorialDataset_0003,CNC(=O)C12CC1C(n1cnc3c1nc(C#CCCCCC(=O)OC)nc3NC...,5.45,2009.0,RepresentationTutorialDataset_0003
RepresentationTutorialDataset_0004,CCCn1c(=O)c2c(nc3cc(OC)ccn32)n(CCCNC(=O)c2ccc(...,5.20,2019.0,RepresentationTutorialDataset_0004
...,...,...,...,...
RepresentationTutorialDataset_4077,CNc1ncc(C(=O)NCc2ccc(OC)cc2)c2nc(-c3ccco3)nn12,7.09,2018.0,RepresentationTutorialDataset_4077
RepresentationTutorialDataset_4078,Nc1nc(-c2ccco2)c2ncn(C(=O)NCCc3ccccc3)c2n1,8.22,2008.0,RepresentationTutorialDataset_4078
RepresentationTutorialDataset_4079,Nc1nc(Nc2ccc(F)cc2)nc(CSc2nnc(N)s2)n1,4.89,2010.0,RepresentationTutorialDataset_4079
RepresentationTutorialDataset_4080,CCCOc1ccc(C=Cc2cc3c(c(=O)n(C)c(=O)n3C)n2C)cc1,6.51,2013.0,RepresentationTutorialDataset_4080


Since
`pandas.DataFrame` is such
a
popular
format, `PropertyStorage`
enforces
that
`getDF`
exists in all
implementations and should
list
all
data
entries and all
properties in the
`PropertyStorage`
object.This is to
facilitate
easy
data
exchange
between
QSPRpred and any
custom
code
that
relies
on
`pandas`.However, we
can
also
do
a
lot
with `PandasDataTable` objects directly:

In [3]:
len(dataset)

4082

the
saved
properties / features:

In [4]:
dataset.getProperties()

['SMILES', 'pchembl_value_Mean', 'Year', 'ID']

You
will
also
notice
that
`PandasDataTable`
objects
also
automatically
create
a
unique
identifier
for each data entry.This is the `idProp` property, which is a unique identifier for each data entry.This is useful for tracking data entries and is used internally by QSPRpred to keep track of data entries and selecting relevant subsets.You can access it as follows:

In [5]:
dataset.idProp

'ID'

In [6]:
dataset.getProperty(dataset.idProp)

ID
RepresentationTutorialDataset_0000    RepresentationTutorialDataset_0000
RepresentationTutorialDataset_0001    RepresentationTutorialDataset_0001
RepresentationTutorialDataset_0002    RepresentationTutorialDataset_0002
RepresentationTutorialDataset_0003    RepresentationTutorialDataset_0003
RepresentationTutorialDataset_0004    RepresentationTutorialDataset_0004
                                                     ...                
RepresentationTutorialDataset_4077    RepresentationTutorialDataset_4077
RepresentationTutorialDataset_4078    RepresentationTutorialDataset_4078
RepresentationTutorialDataset_4079    RepresentationTutorialDataset_4079
RepresentationTutorialDataset_4080    RepresentationTutorialDataset_4080
RepresentationTutorialDataset_4081    RepresentationTutorialDataset_4081
Name: ID, Length: 4082, dtype: object

Knowing
the
identifier, you
can
select
a
subset
of
the
data
set:

In [7]:
subset = dataset.getSubset(["SMILES", "Year"],
                           ids=["RepresentationTutorialDataset_0000",
                                "RepresentationTutorialDataset_0001"])
subset.getDF()

Unnamed: 0_level_0,SMILES,Year,ID
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
RepresentationTutorialDataset_0000,Cc1nn(-c2cc(NC(=O)CCN(C)C)nc(-c3ccc(C)o3)n2)c(...,2008.0,RepresentationTutorialDataset_0000
RepresentationTutorialDataset_0001,Nc1c(C(=O)Nc2ccc([N+](=O)[O-])cc2)sc2c1cc1CCCC...,2010.0,RepresentationTutorialDataset_0001


Notice
that
the
subset is actually
also
a
`PandasDataTable`
object, so
you
can
perform
the
same
operations
on
it as on
the
original
data
set.

You
can
also
just
get
values
of
a
single
property
for certain molecules:

In [8]:
dataset.getProperty("pchembl_value_Mean", ids=["RepresentationTutorialDataset_0000",
                                               "RepresentationTutorialDataset_0001"])

ID
RepresentationTutorialDataset_0000    8.68
RepresentationTutorialDataset_0001    4.82
Name: pchembl_value_Mean, dtype: float64

This is extended
further and in this
particular
case
we
can
also
perform
simple
searches
on
properties:

In [9]:
subset = dataset.searchOnProperty("Year", [2009, 2010], exact=True)
subset.getDF()

Unnamed: 0_level_0,SMILES,pchembl_value_Mean,Year,ID
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
RepresentationTutorialDataset_0001,Nc1c(C(=O)Nc2ccc([N+](=O)[O-])cc2)sc2c1cc1CCCC...,4.82,2010.0,RepresentationTutorialDataset_0001
RepresentationTutorialDataset_0002,O=C(Nc1nc2ncccc2n2c(=O)n(-c3ccccc3)nc12)c1ccccc1,5.65,2009.0,RepresentationTutorialDataset_0002
RepresentationTutorialDataset_0003,CNC(=O)C12CC1C(n1cnc3c1nc(C#CCCCCC(=O)OC)nc3NC...,5.45,2009.0,RepresentationTutorialDataset_0003
RepresentationTutorialDataset_0009,CCCn1c(=O)c2c([nH]c(-c3ccccc3)n2)n(CCCOC)c1=O,6.47,2009.0,RepresentationTutorialDataset_0009
RepresentationTutorialDataset_0018,O=C(Nc1nc(-c2ccccc2)nc2nn(Cc3ccccc3)cc12)c1ccccc1,6.74,2010.0,RepresentationTutorialDataset_0018
...,...,...,...,...
RepresentationTutorialDataset_4049,Nc1nc(-c2ccco2)cc(C(=O)NCc2ccccc2Cl)n1,8.59,2009.0,RepresentationTutorialDataset_4049
RepresentationTutorialDataset_4050,COc1ccccc1-c1cc(C(=O)NCc2ccccn2)nc(N)n1,7.24,2009.0,RepresentationTutorialDataset_4050
RepresentationTutorialDataset_4060,N#Cc1cccc(C(=O)Nc2nc3c(ncc(C(=O)N4CCCCC4)c3)n2...,6.75,2010.0,RepresentationTutorialDataset_4060
RepresentationTutorialDataset_4061,COc1ccc(CCSc2cc3nc(-c4ccco4)nn3c(N)n2)cc1,8.80,2009.0,RepresentationTutorialDataset_4061


You
can
also
do
some
operations
on
the
data
frame, like
shuffle
it(always
the
same
result
thanks
to
the
fixed
random
state):

In [10]:
dataset.shuffle()
dataset.getDF()

Unnamed: 0_level_0,SMILES,pchembl_value_Mean,Year,ID
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
RepresentationTutorialDataset_0599,CCCn1c(-c2ccccc2)nc2c1ncnc2NC1CCOC1,5.77,2018.0,RepresentationTutorialDataset_0599
RepresentationTutorialDataset_0752,CCCn1c(=O)c2c([nH]c(-c3c[nH]nc3)n2)n(CCC)c1=O,6.64,2006.0,RepresentationTutorialDataset_0752
RepresentationTutorialDataset_1954,COc1cccc2c1nc(N)n1nc(CN3CCN(c4ncc(F)cc4)CC3C)nc21,7.88,2015.0,RepresentationTutorialDataset_1954
RepresentationTutorialDataset_2928,COc1cccc(CCCC(=O)Nc2nc3c(cccc3)c(=O)s2)c1,6.94,2013.0,RepresentationTutorialDataset_2928
RepresentationTutorialDataset_2512,COc1c2nc(NC(=O)c3ccc(F)cc3)sc2c(N(CCO)C(C)=O)cc1,7.01,2010.0,RepresentationTutorialDataset_2512
...,...,...,...,...
RepresentationTutorialDataset_1130,CCNC(=O)C1OC(n2cnc3c2nc(C#CC2(O)CCCC2)nc3NCC)C...,6.03,2006.0,RepresentationTutorialDataset_1130
RepresentationTutorialDataset_1294,CNC(=O)C1SC(n2cnc3c2nc(Cl)nc3NCc2cc(I)ccc2)C(O...,6.65,2003.0,RepresentationTutorialDataset_1294
RepresentationTutorialDataset_0860,CCNC(=O)C1OC(n2cnc3c(N)nc(N4CCN(c5ccc(OCC(=O)O...,7.28,2015.0,RepresentationTutorialDataset_0860
RepresentationTutorialDataset_3507,CNC(=O)C1[Se]C(n2cnc3c2ncnc3NC2CCC2)C(O)C1O,5.97,2017.0,RepresentationTutorialDataset_3507


We can also edit the properties:

In [11]:
# get
year = dataset.getProperty("Year")
display(year)
# drop
dataset.removeProperty("Year")
display(dataset.getProperties())
# set
dataset.addProperty("Year", year)
display(dataset.getProperties())
# set only for some ids
dataset.addProperty("Year", [1990, 1990], ids=dataset.getProperty(dataset.idProp)[:2])
display(dataset.getProperty("Year", ids=dataset.getProperty(dataset.idProp)[:2]))

ID
RepresentationTutorialDataset_0599    2018.0
RepresentationTutorialDataset_0752    2006.0
RepresentationTutorialDataset_1954    2015.0
RepresentationTutorialDataset_2928    2013.0
RepresentationTutorialDataset_2512    2010.0
                                       ...  
RepresentationTutorialDataset_1130    2006.0
RepresentationTutorialDataset_1294    2003.0
RepresentationTutorialDataset_0860    2015.0
RepresentationTutorialDataset_3507    2017.0
RepresentationTutorialDataset_3174    1998.0
Name: Year, Length: 4082, dtype: float64

['SMILES', 'pchembl_value_Mean', 'ID']

['SMILES', 'pchembl_value_Mean', 'ID', 'Year']

ID
RepresentationTutorialDataset_0599    1990.0
RepresentationTutorialDataset_0752    1990.0
Name: Year, dtype: float64

You can easily achieve all of the above by editing the data frame directly, but `pandas` syntax can sometimes be cumbersome, so it is nice to have more intuitive methods available. However, you can always access the underlying data frame if more complex operations are needed and then wrap it back into a `PandasDataTable` object.

### `TabularStorageBasic` as `ChemStore`

`PandasDataTable` is not very exciting because it does not offer much on top of the `pandas.DataFrame` class. However, it is a good starting point to understand the `PropertyStorage` API. The `ChemStore` interface is a more advanced version of `PropertyStorage` that is specifically designed for storing and managing chemical data sets. `TabularStorageBasic` implements `ChemStore` using data frames managed by `PandasDataTable` under the hood as well, but thanks to `ChemStore` has a few more capabilities:

In [12]:
from qsprpred.data.chem.identifiers import InchiIdentifier
from qsprpred.data.chem.standardizers.papyrus import PapyrusStandardizer
from qsprpred.data.storage.tabular.basic_storage import TabularStorageBasic

df = pd.read_csv("../../tutorial_data/A2A_LIGANDS.tsv", sep="\t")
storage = TabularStorageBasic(
    name="RepresentationTutorialChemStore",
    path="../../tutorial_output/data",
    df=df,
    smiles_col="SMILES",
    standardizer=PapyrusStandardizer(),  # standardizes the SMILES strings
    identifier=InchiIdentifier()  # generates custom identifiers
)
storage

TabularStorageBasic (4082)

As you can see, the code above took a little while to execute. That is because we also performed custom standardization and unique identification of the molecules. In this case, we already have standardized data, but in other cases it might be useful to standardize and identify molecules to find potential duplicates in your data set. In this sense, QSPRpred is also a molecule registration system that you can use to merge data sets from different sources. If you want to speed things up, you can tell `TabularStorageBasic` to run on multiple CPUs as well:

In [13]:
df = pd.read_csv("../../tutorial_data/A2A_LIGANDS.tsv", sep="\t")
storage = TabularStorageBasic(
    name="RepresentationTutorialChemStore",
    path="../../tutorial_output/data",
    df=df,
    smiles_col="SMILES",
    standardizer=PapyrusStandardizer(),  # standardizes the SMILES strings
    identifier=InchiIdentifier(),  # generates custom identifiers
    n_jobs=os.cpu_count()  # use all available CPUs
)
storage

TabularStorageBasic (4082)

If you have multiple cores available, this should have been considerably faster. Easy parallelization is also one feature you get for free with QSPRpred (see [this advanced tutorial to learn more](../../advanced/data/parallelization.ipynb)).

Remember that the `TabularStorageBasic` object is also a `PropertyStorage` object, so you can use all the methods and attributes of the `PropertyStorage` API on it:

In [14]:
subset = storage.searchOnProperty("Year", [2009, 2010], exact=True)
subset.getDF()

Unnamed: 0_level_0,SMILES,pchembl_value_Mean,Year,original_smiles,ID,ID_before_change
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
AAEYTMMNWWKSKZ-UHFFFAOYSA-N,Nc1c(C(=O)Nc2ccc([N+](=O)[O-])cc2)sc2nc3c(cc12...,4.82,2010.0,Nc1c(C(=O)Nc2ccc([N+](=O)[O-])cc2)sc2nc3c(cc12...,AAEYTMMNWWKSKZ-UHFFFAOYSA-N,AAEYTMMNWWKSKZ-UHFFFAOYSA-N
AAGFKZWKWAMJNP-UHFFFAOYSA-N,O=C(Nc1nc2ncccc2n2c(=O)n(-c3ccccc3)nc12)c1ccccc1,5.65,2009.0,O=C(Nc1nc2ncccc2n2c(=O)n(-c3ccccc3)nc12)c1ccccc1,AAGFKZWKWAMJNP-UHFFFAOYSA-N,AAGFKZWKWAMJNP-UHFFFAOYSA-N
AANUKDYJZPKTKN-UHFFFAOYSA-N,CNC(=O)C12CC1C(n1cnc3c(NCc4cccc(Cl)c4)nc(C#CCC...,5.45,2009.0,CNC(=O)C12CC1C(n1cnc3c(NCc4cccc(Cl)c4)nc(C#CCC...,AANUKDYJZPKTKN-UHFFFAOYSA-N,AANUKDYJZPKTKN-UHFFFAOYSA-N
ABIXUHSEHFCQMV-UHFFFAOYSA-N,CCCn1c(=O)c2[nH]c(-c3ccccc3)nc2n(CCCOC)c1=O,6.47,2009.0,CCCn1c(=O)c2[nH]c(-c3ccccc3)nc2n(CCCOC)c1=O,ABIXUHSEHFCQMV-UHFFFAOYSA-N,ABIXUHSEHFCQMV-UHFFFAOYSA-N
ACNFYYUXBQGWQL-UHFFFAOYSA-N,O=C(Nc1nc(-c2ccccc2)nc2nn(Cc3ccccc3)cc12)c1ccccc1,6.74,2010.0,O=C(Nc1nc(-c2ccccc2)nc2nn(Cc3ccccc3)cc12)c1ccccc1,ACNFYYUXBQGWQL-UHFFFAOYSA-N,ACNFYYUXBQGWQL-UHFFFAOYSA-N
...,...,...,...,...,...,...
ZVWNHOGZGKJOCZ-UHFFFAOYSA-N,Nc1nc(C(=O)NCc2ccccc2Cl)cc(-c2ccco2)n1,8.59,2009.0,Nc1nc(C(=O)NCc2ccccc2Cl)cc(-c2ccco2)n1,ZVWNHOGZGKJOCZ-UHFFFAOYSA-N,ZVWNHOGZGKJOCZ-UHFFFAOYSA-N
ZVYYCMRDDCYZAU-UHFFFAOYSA-N,COc1ccccc1-c1cc(C(=O)NCc2ccccn2)nc(N)n1,7.24,2009.0,COc1ccccc1-c1cc(C(=O)NCc2ccccn2)nc(N)n1,ZVYYCMRDDCYZAU-UHFFFAOYSA-N,ZVYYCMRDDCYZAU-UHFFFAOYSA-N
ZWVWCKOJGDHDIG-UHFFFAOYSA-N,N#Cc1cccc(C(=O)Nc2nc3cc(C(=O)N4CCCCC4)cnc3n2C2...,6.75,2010.0,N#Cc1cccc(C(=O)Nc2nc3cc(C(=O)N4CCCCC4)cnc3n2C2...,ZWVWCKOJGDHDIG-UHFFFAOYSA-N,ZWVWCKOJGDHDIG-UHFFFAOYSA-N
ZXCVHJXQJJLILE-UHFFFAOYSA-N,COc1ccc(CCSc2cc3nc(-c4ccco4)nn3c(N)n2)cc1,8.80,2009.0,COc1ccc(CCSc2cc3nc(-c4ccco4)nn3c(N)n2)cc1,ZXCVHJXQJJLILE-UHFFFAOYSA-N,ZXCVHJXQJJLILE-UHFFFAOYSA-N


In addition to what we already explored, `ChemStore` also adds a few more cheminformatics tools that some might appreciate. You can iterate over the storage and get the molecules as `StoredMol` objects, which have their own capabilities:

In [15]:
for mol in storage:
    print(mol)
    print(mol.as_rd_mol())
    print(mol.smiles)
    print(mol.props)
    print(mol.representations)
    break

TabularMol(AACWUFIIMOHGSO-UHFFFAOYSA-N, Cc1cc(C)n(-c2cc(NC(=O)CCN(C)C)nc(-c3ccc(C)o3)n2)n1)
<rdkit.Chem.rdchem.Mol object at 0x7efee17be500>
Cc1cc(C)n(-c2cc(NC(=O)CCN(C)C)nc(-c3ccc(C)o3)n2)n1
{'ID': 'AACWUFIIMOHGSO-UHFFFAOYSA-N', 'SMILES': 'Cc1cc(C)n(-c2cc(NC(=O)CCN(C)C)nc(-c3ccc(C)o3)n2)n1', 'ID_before_change': 'RepresentationTutorialChemStore_library_0000', 'original_smiles': 'Cc1nn(-c2cc(NC(=O)CCN(C)C)nc(-c3ccc(C)o3)n2)c(C)c1', 'Year': 2008.0, 'pchembl_value_Mean': 8.68}
None


Therefore, we have all the information about the molecule we can get, and we can also easily turn it into an rdkit molecule object. Not that the `representations` property is currently empty for the molecules, which would be populated if we had conformers, protomers, tautomers or other representations of the molecule present in the storage. This feature is not implemented yet, but will be soon (feel free to inquire about the status on the [issue tracker](https://github.com/CDDLeiden/QSPRpred/issues) or via [email](https://github.com/CDDLeiden/QSPRpred/blob/main/pyproject.toml)).

You can also iterate over the molecules in chunks:

In [16]:
for chunk in storage.iterChunks(size=2):
    print(chunk)
    break

[<qsprpred.data.storage.tabular.stored_mol.TabularMol object at 0x7eff4dbcf620>, <qsprpred.data.storage.tabular.stored_mol.TabularMol object at 0x7eff4dbcc350>]


This can be useful when processing large data sets one chunk at a time and with a smart implementation of `ChemStore.iterChunks` the data set does not have to be loaded into memory all at once. The chunks can also be consumed in parallel, which can speed up processing even further (see [this advanced tutorial to learn more](../../advanced/data/parallelization.ipynb)).