# Data Representation

In this tutorial, you will learn how data sets are represented in QSPRpred and how you can use the framework to store and prepare data sets not only for QSPR modeling, but general cheminformatics tasks as well.

## Data Representation API (`PropertyStorage` and `ChemStore`)

### Overview

When designing the storage API we tried to identify the most common tasks that need to be performed when working with diverse cheminformatics data sets, mainly in the context of QSPR modelling, but it can also be used to store data from molecular docking or other structure-based simulations. Therefore, QSPRpred defines a general API to register and store properties (independent variables) for arbitrary data entries in its `PropertyStorage` abstract class, which is then further extended by the `ChemStore` interface that supports more specific functionality for encoding molecules alongside their properties. If you take a look at the [API documentation](https://cddleiden.github.io/QSPRpred/docs/api/modules.html) of these classes, you can see the methods and attributes to interact with them. Therefore, anyone can implement any kind of storage system to store compound representations and their properties and as long as they adhere to the above interfaces, their storage system can be used in QSPRpred seamlessly. This potentially enables more advanced users to interface different storage backends (i.e. SQL databases, NoSQL databases, online REST APIs or prohibitively large data sets) with QSPRpred as well. Since this is more advanced functionality, it is not yet covered in this tutorial, which only focuses on currently available implementations that focus on storing data locally by the means of `pandas` data frames. However, we are happy for any inquiries about developing clients for custom APIs or databases. Let us know on the [issue tracker](https://github.com/CDDLeiden/QSPRpred/issues) or via [email](https://github.com/CDDLeiden/QSPRpred/blob/main/pyproject.toml).

### `PandasDataTable` as `PropertyStorage`

**Note: Feel free to skip this part of the tutorial and continue to the "`TabularStorageBasic` as `ChemStore`" section if you are more interested in the cheminformatics features of QSPRpred and are not interested in understanding `PropertyStorage` in detail.**

Tabular data is the most common data type in QSPR modelling and `pandas` is the Python package of choice when it comes to processing it. Therefore, we decided to compose the default `PropertyStorage` implementation around it and provide a light wrapper for the `pandas.DataFrame` class called `PandasDataTable`. `PandasDataTable` objects simply manage storage and state of a given `pandas.DataFrame` and giving it all features of the `PropertyStorage` API at the same time. You will typically not interact with these objects directly, but we will now use it for the demonstration of some functions facilitated by the `PropertyStorage` API. We will use the `A2A_LIGANDS.tsv` file from the tutorial data folder as an example data set. This file contains a list of ligands for the adenosine A2A receptor, which is a common target in drug discovery. The data set contains SMILES strings and some other properties relevant for QSPR modelling:

In [1]:
import pandas as pd

df = pd.read_csv("../../tutorial_data/A2A_LIGANDS.tsv", sep="\t")

df.head()

Unnamed: 0,SMILES,pchembl_value_Mean,Year
0,Cc1nn(-c2cc(NC(=O)CCN(C)C)nc(-c3ccc(C)o3)n2)c(...,8.68,2008.0
1,Nc1c(C(=O)Nc2ccc([N+](=O)[O-])cc2)sc2c1cc1CCCC...,4.82,2010.0
2,O=C(Nc1nc2ncccc2n2c(=O)n(-c3ccccc3)nc12)c1ccccc1,5.65,2009.0
3,CNC(=O)C12CC1C(n1cnc3c1nc(C#CCCCCC(=O)OC)nc3NC...,5.45,2009.0
4,CCCn1c(=O)c2c(nc3cc(OC)ccn32)n(CCCNC(=O)c2ccc(...,5.2,2019.0


Wrapping this data frame in a `PandasDataTable` object is simple:

In [2]:
from qsprpred.data.tables.pandas import PandasDataTable
import os

random_state = 42  # for reproducibility of all random operations
os.makedirs("../../tutorial_output/data",
            exist_ok=True)  # create the output directory if it does not exist yet
dataset = PandasDataTable(df=df, store_dir="../../tutorial_output/data",
                          name="RepresentationTutorialDataset",
                          random_state=random_state)
dataset.getDF()

Unnamed: 0_level_0,SMILES,pchembl_value_Mean,Year,ID
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
RepresentationTutorialDataset_0000,Cc1nn(-c2cc(NC(=O)CCN(C)C)nc(-c3ccc(C)o3)n2)c(...,8.68,2008.0,RepresentationTutorialDataset_0000
RepresentationTutorialDataset_0001,Nc1c(C(=O)Nc2ccc([N+](=O)[O-])cc2)sc2c1cc1CCCC...,4.82,2010.0,RepresentationTutorialDataset_0001
RepresentationTutorialDataset_0002,O=C(Nc1nc2ncccc2n2c(=O)n(-c3ccccc3)nc12)c1ccccc1,5.65,2009.0,RepresentationTutorialDataset_0002
RepresentationTutorialDataset_0003,CNC(=O)C12CC1C(n1cnc3c1nc(C#CCCCCC(=O)OC)nc3NC...,5.45,2009.0,RepresentationTutorialDataset_0003
RepresentationTutorialDataset_0004,CCCn1c(=O)c2c(nc3cc(OC)ccn32)n(CCCNC(=O)c2ccc(...,5.20,2019.0,RepresentationTutorialDataset_0004
...,...,...,...,...
RepresentationTutorialDataset_4077,CNc1ncc(C(=O)NCc2ccc(OC)cc2)c2nc(-c3ccco3)nn12,7.09,2018.0,RepresentationTutorialDataset_4077
RepresentationTutorialDataset_4078,Nc1nc(-c2ccco2)c2ncn(C(=O)NCCc3ccccc3)c2n1,8.22,2008.0,RepresentationTutorialDataset_4078
RepresentationTutorialDataset_4079,Nc1nc(Nc2ccc(F)cc2)nc(CSc2nnc(N)s2)n1,4.89,2010.0,RepresentationTutorialDataset_4079
RepresentationTutorialDataset_4080,CCCOc1ccc(C=Cc2cc3c(c(=O)n(C)c(=O)n3C)n2C)cc1,6.51,2013.0,RepresentationTutorialDataset_4080


Since `pandas.DataFrame` is such a popular format, `PropertyStorage` enforces that `getDF` exists in all implementations and should list all data entries and all properties in the `PropertyStorage` object. This is to facilitate easy data exchange between QSPRpred and any custom code that relies on `pandas`. However, we can also do a lot with `PandasDataTable` objects directly:

In [3]:
len(dataset)

4082

the saved properties/features:

In [4]:
dataset.getProperties()

['SMILES', 'pchembl_value_Mean', 'Year', 'ID']

You will also notice that `PandasDataTable` objects also automatically create a unique identifier for each data entry. This is the `idProp` property, which is a unique identifier for each data entry. This is useful for tracking data entries and is used internally by QSPRpred to keep track of data entries and selecting relevant subsets. You can access it as follows:

In [5]:
dataset.idProp

'ID'

In [6]:
dataset.getProperty(dataset.idProp)

ID
RepresentationTutorialDataset_0000    RepresentationTutorialDataset_0000
RepresentationTutorialDataset_0001    RepresentationTutorialDataset_0001
RepresentationTutorialDataset_0002    RepresentationTutorialDataset_0002
RepresentationTutorialDataset_0003    RepresentationTutorialDataset_0003
RepresentationTutorialDataset_0004    RepresentationTutorialDataset_0004
                                                     ...                
RepresentationTutorialDataset_4077    RepresentationTutorialDataset_4077
RepresentationTutorialDataset_4078    RepresentationTutorialDataset_4078
RepresentationTutorialDataset_4079    RepresentationTutorialDataset_4079
RepresentationTutorialDataset_4080    RepresentationTutorialDataset_4080
RepresentationTutorialDataset_4081    RepresentationTutorialDataset_4081
Name: ID, Length: 4082, dtype: object

Knowing the identifier, you can select a subset of the data set:

In [7]:
subset = dataset.getSubset(["SMILES", "Year"],
                           ids=["RepresentationTutorialDataset_0000",
                                "RepresentationTutorialDataset_0001"])
subset.getDF()

Unnamed: 0_level_0,SMILES,Year,ID
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
RepresentationTutorialDataset_0000,Cc1nn(-c2cc(NC(=O)CCN(C)C)nc(-c3ccc(C)o3)n2)c(...,2008.0,RepresentationTutorialDataset_0000
RepresentationTutorialDataset_0001,Nc1c(C(=O)Nc2ccc([N+](=O)[O-])cc2)sc2c1cc1CCCC...,2010.0,RepresentationTutorialDataset_0001


Notice that the subset is actually also a `PandasDataTable` object, so you can perform the same operations on it as on the original data set. 

You can also just get values of a single property for certain molecules:

In [8]:
dataset.getProperty("pchembl_value_Mean", ids=["RepresentationTutorialDataset_0000",
                                               "RepresentationTutorialDataset_0001"])

ID
RepresentationTutorialDataset_0000    8.68
RepresentationTutorialDataset_0001    4.82
Name: pchembl_value_Mean, dtype: float64

This is extended further and in this particular case we can also perform simple searches on properties:

In [9]:
subset = dataset.searchOnProperty("Year", [2009, 2010], exact=True)
subset.getDF()

Unnamed: 0_level_0,SMILES,pchembl_value_Mean,Year,ID
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
RepresentationTutorialDataset_0001,Nc1c(C(=O)Nc2ccc([N+](=O)[O-])cc2)sc2c1cc1CCCC...,4.82,2010.0,RepresentationTutorialDataset_0001
RepresentationTutorialDataset_0002,O=C(Nc1nc2ncccc2n2c(=O)n(-c3ccccc3)nc12)c1ccccc1,5.65,2009.0,RepresentationTutorialDataset_0002
RepresentationTutorialDataset_0003,CNC(=O)C12CC1C(n1cnc3c1nc(C#CCCCCC(=O)OC)nc3NC...,5.45,2009.0,RepresentationTutorialDataset_0003
RepresentationTutorialDataset_0009,CCCn1c(=O)c2c([nH]c(-c3ccccc3)n2)n(CCCOC)c1=O,6.47,2009.0,RepresentationTutorialDataset_0009
RepresentationTutorialDataset_0018,O=C(Nc1nc(-c2ccccc2)nc2nn(Cc3ccccc3)cc12)c1ccccc1,6.74,2010.0,RepresentationTutorialDataset_0018
...,...,...,...,...
RepresentationTutorialDataset_4049,Nc1nc(-c2ccco2)cc(C(=O)NCc2ccccc2Cl)n1,8.59,2009.0,RepresentationTutorialDataset_4049
RepresentationTutorialDataset_4050,COc1ccccc1-c1cc(C(=O)NCc2ccccn2)nc(N)n1,7.24,2009.0,RepresentationTutorialDataset_4050
RepresentationTutorialDataset_4060,N#Cc1cccc(C(=O)Nc2nc3c(ncc(C(=O)N4CCCCC4)c3)n2...,6.75,2010.0,RepresentationTutorialDataset_4060
RepresentationTutorialDataset_4061,COc1ccc(CCSc2cc3nc(-c4ccco4)nn3c(N)n2)cc1,8.80,2009.0,RepresentationTutorialDataset_4061


You can also do some operations on the data frame, like shuffle it (always the same result thanks to the fixed random state):

In [10]:
dataset.shuffle()
dataset.getDF()

Unnamed: 0_level_0,SMILES,pchembl_value_Mean,Year,ID
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
RepresentationTutorialDataset_0599,CCCn1c(-c2ccccc2)nc2c1ncnc2NC1CCOC1,5.77,2018.0,RepresentationTutorialDataset_0599
RepresentationTutorialDataset_0752,CCCn1c(=O)c2c([nH]c(-c3c[nH]nc3)n2)n(CCC)c1=O,6.64,2006.0,RepresentationTutorialDataset_0752
RepresentationTutorialDataset_1954,COc1cccc2c1nc(N)n1nc(CN3CCN(c4ncc(F)cc4)CC3C)nc21,7.88,2015.0,RepresentationTutorialDataset_1954
RepresentationTutorialDataset_2928,COc1cccc(CCCC(=O)Nc2nc3c(cccc3)c(=O)s2)c1,6.94,2013.0,RepresentationTutorialDataset_2928
RepresentationTutorialDataset_2512,COc1c2nc(NC(=O)c3ccc(F)cc3)sc2c(N(CCO)C(C)=O)cc1,7.01,2010.0,RepresentationTutorialDataset_2512
...,...,...,...,...
RepresentationTutorialDataset_1130,CCNC(=O)C1OC(n2cnc3c2nc(C#CC2(O)CCCC2)nc3NCC)C...,6.03,2006.0,RepresentationTutorialDataset_1130
RepresentationTutorialDataset_1294,CNC(=O)C1SC(n2cnc3c2nc(Cl)nc3NCc2cc(I)ccc2)C(O...,6.65,2003.0,RepresentationTutorialDataset_1294
RepresentationTutorialDataset_0860,CCNC(=O)C1OC(n2cnc3c(N)nc(N4CCN(c5ccc(OCC(=O)O...,7.28,2015.0,RepresentationTutorialDataset_0860
RepresentationTutorialDataset_3507,CNC(=O)C1[Se]C(n2cnc3c2ncnc3NC2CCC2)C(O)C1O,5.97,2017.0,RepresentationTutorialDataset_3507


We can also edit the properties:

In [11]:
# get
year = dataset.getProperty("Year")
display(year)
# drop
dataset.removeProperty("Year")
display(dataset.getProperties())
# set
dataset.addProperty("Year", year)
display(dataset.getProperties())
# set only for some ids
dataset.addProperty("Year", [1990, 1990], ids=dataset.getProperty(dataset.idProp)[:2])
display(dataset.getProperty("Year", ids=dataset.getProperty(dataset.idProp)[:2]))

ID
RepresentationTutorialDataset_0599    2018.0
RepresentationTutorialDataset_0752    2006.0
RepresentationTutorialDataset_1954    2015.0
RepresentationTutorialDataset_2928    2013.0
RepresentationTutorialDataset_2512    2010.0
                                       ...  
RepresentationTutorialDataset_1130    2006.0
RepresentationTutorialDataset_1294    2003.0
RepresentationTutorialDataset_0860    2015.0
RepresentationTutorialDataset_3507    2017.0
RepresentationTutorialDataset_3174    1998.0
Name: Year, Length: 4082, dtype: float64

['SMILES', 'pchembl_value_Mean', 'ID']

['SMILES', 'pchembl_value_Mean', 'ID', 'Year']

ID
RepresentationTutorialDataset_0599    1990.0
RepresentationTutorialDataset_0752    1990.0
Name: Year, dtype: float64

You can easily achieve all of those by editing the data frame directly, but `pandas` syntax can sometimes be cumbersome, so it is nice to have more intuitive methods available. However, you can always access the underlying data frame if more complex operations are needed and then wrap it back into a `PandasDataTable` object.

### `TabularStorageBasic` as `ChemStore`

`PandasDataTable` is not very exciting because it does not offer much on top of the `pandas.DataFrame` class. However, it is a good starting point to understand the `PropertyStorage` API. The `ChemStore` interface is a more advanced version of `PropertyStorage` that is specifically designed for storing and managing chemical data sets. `TabularStorageBasic` implements `ChemStore` using data frames managed by `PandasDataTable` under the hood as well, but thanks to `ChemStore` has a few more capabilities:

In [12]:
from qsprpred.data.chem.identifiers import InchiIdentifier
from qsprpred.data.chem.standardizers.papyrus import PapyrusStandardizer
from qsprpred.data.storage.tabular.basic_storage import TabularStorageBasic

df = pd.read_csv("../../tutorial_data/A2A_LIGANDS.tsv", sep="\t")
storage = TabularStorageBasic(
    name="RepresentationTutorialChemStore",
    path="../../tutorial_output/data",
    df=df,
    smiles_col="SMILES",
    standardizer=PapyrusStandardizer(),  # standardizes the SMILES strings
    identifier=InchiIdentifier()  # generates custom identifiers
)
storage

TabularStorageBasic (4082)

As you can see, the code above took a little while to execute. That is because we also performed custom standardization and unique identification of the molecules. In this case, we already have standardized data, but in other cases it might be useful to standardize and identify molecules to find potential duplicates in your data set. In this sense, QSPRpred is also a molecule registration system that you can use to merge data sets from different sources. If you want to speed things up, you can tell `TabularStorageBasic` to run on multiple CPUs as well:

In [13]:
df = pd.read_csv("../../tutorial_data/A2A_LIGANDS.tsv", sep="\t")
storage = TabularStorageBasic(
    name="RepresentationTutorialChemStore",
    path="../../tutorial_output/data",
    df=df,
    smiles_col="SMILES",
    standardizer=PapyrusStandardizer(),  # standardizes the SMILES strings
    identifier=InchiIdentifier(),  # generates custom identifiers
    n_jobs=os.cpu_count()  # use all available CPUs
)
storage

TabularStorageBasic (4082)

If you have multiple cores available, this should have been considerably faster. Easy parallelization is also one feature you get for free with QSPRpred (see [this advanced tutorial to learn more](../../advanced/data/parallelization.ipynb)).

Remember that the `TabularStorageBasic` object is also a `PropertyStorage` object, so you can use all the methods and attributes of the `PropertyStorage` API on it:

In [14]:
subset = storage.searchOnProperty("Year", [2009, 2010], exact=True)
subset.getDF()

Unnamed: 0_level_0,SMILES,pchembl_value_Mean,Year,original_smiles,ID,ID_before_change
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
AAEYTMMNWWKSKZ-UHFFFAOYSA-N,Nc1c(C(=O)Nc2ccc([N+](=O)[O-])cc2)sc2nc3c(cc12...,4.82,2010.0,Nc1c(C(=O)Nc2ccc([N+](=O)[O-])cc2)sc2nc3c(cc12...,AAEYTMMNWWKSKZ-UHFFFAOYSA-N,AAEYTMMNWWKSKZ-UHFFFAOYSA-N
AAGFKZWKWAMJNP-UHFFFAOYSA-N,O=C(Nc1nc2ncccc2n2c(=O)n(-c3ccccc3)nc12)c1ccccc1,5.65,2009.0,O=C(Nc1nc2ncccc2n2c(=O)n(-c3ccccc3)nc12)c1ccccc1,AAGFKZWKWAMJNP-UHFFFAOYSA-N,AAGFKZWKWAMJNP-UHFFFAOYSA-N
AANUKDYJZPKTKN-UHFFFAOYSA-N,CNC(=O)C12CC1C(n1cnc3c(NCc4cccc(Cl)c4)nc(C#CCC...,5.45,2009.0,CNC(=O)C12CC1C(n1cnc3c(NCc4cccc(Cl)c4)nc(C#CCC...,AANUKDYJZPKTKN-UHFFFAOYSA-N,AANUKDYJZPKTKN-UHFFFAOYSA-N
ABIXUHSEHFCQMV-UHFFFAOYSA-N,CCCn1c(=O)c2[nH]c(-c3ccccc3)nc2n(CCCOC)c1=O,6.47,2009.0,CCCn1c(=O)c2[nH]c(-c3ccccc3)nc2n(CCCOC)c1=O,ABIXUHSEHFCQMV-UHFFFAOYSA-N,ABIXUHSEHFCQMV-UHFFFAOYSA-N
ACNFYYUXBQGWQL-UHFFFAOYSA-N,O=C(Nc1nc(-c2ccccc2)nc2nn(Cc3ccccc3)cc12)c1ccccc1,6.74,2010.0,O=C(Nc1nc(-c2ccccc2)nc2nn(Cc3ccccc3)cc12)c1ccccc1,ACNFYYUXBQGWQL-UHFFFAOYSA-N,ACNFYYUXBQGWQL-UHFFFAOYSA-N
...,...,...,...,...,...,...
ZVWNHOGZGKJOCZ-UHFFFAOYSA-N,Nc1nc(C(=O)NCc2ccccc2Cl)cc(-c2ccco2)n1,8.59,2009.0,Nc1nc(C(=O)NCc2ccccc2Cl)cc(-c2ccco2)n1,ZVWNHOGZGKJOCZ-UHFFFAOYSA-N,ZVWNHOGZGKJOCZ-UHFFFAOYSA-N
ZVYYCMRDDCYZAU-UHFFFAOYSA-N,COc1ccccc1-c1cc(C(=O)NCc2ccccn2)nc(N)n1,7.24,2009.0,COc1ccccc1-c1cc(C(=O)NCc2ccccn2)nc(N)n1,ZVYYCMRDDCYZAU-UHFFFAOYSA-N,ZVYYCMRDDCYZAU-UHFFFAOYSA-N
ZWVWCKOJGDHDIG-UHFFFAOYSA-N,N#Cc1cccc(C(=O)Nc2nc3cc(C(=O)N4CCCCC4)cnc3n2C2...,6.75,2010.0,N#Cc1cccc(C(=O)Nc2nc3cc(C(=O)N4CCCCC4)cnc3n2C2...,ZWVWCKOJGDHDIG-UHFFFAOYSA-N,ZWVWCKOJGDHDIG-UHFFFAOYSA-N
ZXCVHJXQJJLILE-UHFFFAOYSA-N,COc1ccc(CCSc2cc3nc(-c4ccco4)nn3c(N)n2)cc1,8.80,2009.0,COc1ccc(CCSc2cc3nc(-c4ccco4)nn3c(N)n2)cc1,ZXCVHJXQJJLILE-UHFFFAOYSA-N,ZXCVHJXQJJLILE-UHFFFAOYSA-N


In addition to what we already explored, `ChemStore` also adds a few more cheminformatics tools that some might appreciate. You can iterate over the storage and get the molecules as `StoredMol` objects, which have their own capabilities:

In [15]:
for mol in storage:
    print(mol)
    print(mol.as_rd_mol())
    print(mol.smiles)
    print(mol.props)
    print(mol.representations)
    break

TabularMol(AACWUFIIMOHGSO-UHFFFAOYSA-N, Cc1cc(C)n(-c2cc(NC(=O)CCN(C)C)nc(-c3ccc(C)o3)n2)n1)
<rdkit.Chem.rdchem.Mol object at 0x7f92ca9b8820>
Cc1cc(C)n(-c2cc(NC(=O)CCN(C)C)nc(-c3ccc(C)o3)n2)n1
{'Year': 2008.0, 'original_smiles': 'Cc1nn(-c2cc(NC(=O)CCN(C)C)nc(-c3ccc(C)o3)n2)c(C)c1', 'ID_before_change': 'RepresentationTutorialChemStore_library_0000', 'ID': 'AACWUFIIMOHGSO-UHFFFAOYSA-N', 'SMILES': 'Cc1cc(C)n(-c2cc(NC(=O)CCN(C)C)nc(-c3ccc(C)o3)n2)n1', 'pchembl_value_Mean': 8.68}
None


Therefore, we have all the information about the molecule we can get, and we can also easily turn it into an rdkit molecule object. Not that the `representations` property is currently empty for the molecules, which would be populated if we had conformers, protomers, tautomers or other representations of the molecule present in the storage. This feature is not implemented yet, but will be soon (feel free to inquire about the status on the [issue tracker](https://github.com/CDDLeiden/QSPRpred/issues) or via [email](https://github.com/CDDLeiden/QSPRpred/blob/main/pyproject.toml)).

You can also iterate over the molecules in chunks:

In [16]:
for chunk in storage.iterChunks(size=2):
    print(chunk)
    break

[<qsprpred.data.storage.tabular.stored_mol.TabularMol object at 0x7f9337fc5f10>, <qsprpred.data.storage.tabular.stored_mol.TabularMol object at 0x7f9337fc6780>]


This can be useful when processing large data sets one chunk at a time and with a smart implementation of `ChemStore.iterChunks` the data set does not have to be loaded into memory all at once. The chunks can also be consumed in parallel, which can speed up processing even further (see [this advanced tutorial to learn more](../../advanced/data/parallelization.ipynb)).

### `MoleculeTable` and `QSPRDataset`

Now that we know a bit about how QSPRpred stores molecules, we can take a look at the data structures you know from [the quick start](../../quick_start.ipynb) and how they are implemented. The `MoleculeTable` and `QSPRDataset` classes are specific for QSPR modelling tasks and implement a selection of interfaces for this purpose. Check out entries for `MoleculeDataSet` and `QSPRDataSet` abstract classes in the [API documentation](https://cddleiden.github.io/QSPRpred/docs/api/modules.html) to see what they offer. The main thing to remember for this tutorial, however, is that `MoleculeTable` adds the ability to add and store molecular descriptors and `QSPRDataset` is its subclass, which adds the ability to store information about target properties and modelling tasks. We can initialize them from `ChemStore` instances quite easily:

In [17]:
from qsprpred.data import MoleculeTable

mt = MoleculeTable(
    storage,  # ChemStore object 
    name="RepresentationTutorialMoleculeTable",
    path="../../tutorial_output/data",
    # where the molecule table associated data will live
)
mt.getDF()

Unnamed: 0_level_0,SMILES,pchembl_value_Mean,Year,original_smiles,ID,ID_before_change
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
AACWUFIIMOHGSO-UHFFFAOYSA-N,Cc1cc(C)n(-c2cc(NC(=O)CCN(C)C)nc(-c3ccc(C)o3)n...,8.68,2008.0,Cc1nn(-c2cc(NC(=O)CCN(C)C)nc(-c3ccc(C)o3)n2)c(...,AACWUFIIMOHGSO-UHFFFAOYSA-N,RepresentationTutorialChemStore_library_0000
AAEYTMMNWWKSKZ-UHFFFAOYSA-N,Nc1c(C(=O)Nc2ccc([N+](=O)[O-])cc2)sc2nc3c(cc12...,4.82,2010.0,Nc1c(C(=O)Nc2ccc([N+](=O)[O-])cc2)sc2c1cc1CCCC...,AAEYTMMNWWKSKZ-UHFFFAOYSA-N,RepresentationTutorialChemStore_library_0001
AAGFKZWKWAMJNP-UHFFFAOYSA-N,O=C(Nc1nc2ncccc2n2c(=O)n(-c3ccccc3)nc12)c1ccccc1,5.65,2009.0,O=C(Nc1nc2ncccc2n2c(=O)n(-c3ccccc3)nc12)c1ccccc1,AAGFKZWKWAMJNP-UHFFFAOYSA-N,RepresentationTutorialChemStore_library_0002
AANUKDYJZPKTKN-UHFFFAOYSA-N,CNC(=O)C12CC1C(n1cnc3c(NCc4cccc(Cl)c4)nc(C#CCC...,5.45,2009.0,CNC(=O)C12CC1C(n1cnc3c1nc(C#CCCCCC(=O)OC)nc3NC...,AANUKDYJZPKTKN-UHFFFAOYSA-N,RepresentationTutorialChemStore_library_0003
AASXHCGIIQCKEE-UHFFFAOYSA-N,CCCn1c(=O)c2c(nc3cc(OC)ccn32)n(CCCNC(=O)c2ccc(...,5.20,2019.0,CCCn1c(=O)c2c(nc3cc(OC)ccn32)n(CCCNC(=O)c2ccc(...,AASXHCGIIQCKEE-UHFFFAOYSA-N,RepresentationTutorialChemStore_library_0004
...,...,...,...,...,...,...
ZYXGKENMDDPQIE-UHFFFAOYSA-N,CNc1ncc(C(=O)NCc2ccc(OC)cc2)c2nc(-c3ccco3)nn12,7.09,2018.0,CNc1ncc(C(=O)NCc2ccc(OC)cc2)c2nc(-c3ccco3)nn12,ZYXGKENMDDPQIE-UHFFFAOYSA-N,RepresentationTutorialChemStore_library_4077
ZYZWFDVXMLCIOU-UHFFFAOYSA-N,Nc1nc(-c2ccco2)c2ncn(C(=O)NCCc3ccccc3)c2n1,8.22,2008.0,Nc1nc(-c2ccco2)c2ncn(C(=O)NCCc3ccccc3)c2n1,ZYZWFDVXMLCIOU-UHFFFAOYSA-N,RepresentationTutorialChemStore_library_4078
ZZBZWSYDXUPJCT-UHFFFAOYSA-N,Nc1nc(CSc2nnc(N)s2)nc(Nc2ccc(F)cc2)n1,4.89,2010.0,Nc1nc(Nc2ccc(F)cc2)nc(CSc2nnc(N)s2)n1,ZZBZWSYDXUPJCT-UHFFFAOYSA-N,RepresentationTutorialChemStore_library_4079
ZZMIPZLRKFEGIA-UHFFFAOYSA-N,CCCOc1ccc(C=Cc2cc3c(c(=O)n(C)c(=O)n3C)n2C)cc1,6.51,2013.0,CCCOc1ccc(C=Cc2cc3c(c(=O)n(C)c(=O)n3C)n2C)cc1,ZZMIPZLRKFEGIA-UHFFFAOYSA-N,RepresentationTutorialChemStore_library_4080


Again, since this is also a `PropertyStorage` object, you can use all the methods and attributes of the `PropertyStorage` API on it and it also exposes a lot of the underlying storage methods and functionality as well: 

In [18]:
for mol in mt:
    print(mol)
    print(mol.as_rd_mol())
    print(mol.smiles)
    print(mol.props)
    print(mol.representations)
    break

TabularMol(AACWUFIIMOHGSO-UHFFFAOYSA-N, Cc1cc(C)n(-c2cc(NC(=O)CCN(C)C)nc(-c3ccc(C)o3)n2)n1)
<rdkit.Chem.rdchem.Mol object at 0x7f92ca9b8820>
Cc1cc(C)n(-c2cc(NC(=O)CCN(C)C)nc(-c3ccc(C)o3)n2)n1
{'Year': 2008.0, 'original_smiles': 'Cc1nn(-c2cc(NC(=O)CCN(C)C)nc(-c3ccc(C)o3)n2)c(C)c1', 'ID_before_change': 'RepresentationTutorialChemStore_library_0000', 'ID': 'AACWUFIIMOHGSO-UHFFFAOYSA-N', 'SMILES': 'Cc1cc(C)n(-c2cc(NC(=O)CCN(C)C)nc(-c3ccc(C)o3)n2)n1', 'pchembl_value_Mean': 8.68}
None


Note that `ChemStore` objects are also subscriptable, which is also true for `MoleculeTable` objects:

In [19]:
mt['AACWUFIIMOHGSO-UHFFFAOYSA-N'].props

{'SMILES': 'Cc1cc(C)n(-c2cc(NC(=O)CCN(C)C)nc(-c3ccc(C)o3)n2)n1',
 'pchembl_value_Mean': 8.68,
 'Year': 2008.0,
 'original_smiles': 'Cc1nn(-c2cc(NC(=O)CCN(C)C)nc(-c3ccc(C)o3)n2)c(C)c1',
 'ID': 'AACWUFIIMOHGSO-UHFFFAOYSA-N',
 'ID_before_change': 'RepresentationTutorialChemStore_library_0000'}

`QSPRDataset` is a subclass of `MoleculeTable`, which requires target properties to be defined in addition to the underlying `ChemStore` object:

In [20]:
from qsprpred import TargetTasks, TargetProperty

from qsprpred.data import QSPRDataset

dataset = QSPRDataset(
    storage,  # ChemStore object
    name="RepresentationTutorialDataset",
    path="../../tutorial_output/data",
    target_props=[TargetProperty("pchembl_value_Mean", TargetTasks.REGRESSION)]
)
dataset.targetProperties

[TargetProperty(name=pchembl_value_Mean, task=REGRESSION)]

But you can also create it by converting from a `MoleculeTable` object:

In [21]:
dataset = QSPRDataset.fromMolTable(mt, target_props=[
    TargetProperty("pchembl_value_Mean", TargetTasks.REGRESSION)
])
dataset.targetProperties

[TargetProperty(name=pchembl_value_Mean, task=REGRESSION)]

But you can also go directly from a data frame, which will create the underlying `ChemStore` object for you:

In [22]:
dataset = QSPRDataset.fromDF(
    name="RepresentationTutorialDataset",
    df=df,
    path="../../tutorial_output/data",
    target_props=[TargetProperty("pchembl_value_Mean", TargetTasks.REGRESSION)],
    smiles_col="SMILES"
)
dataset.targetProperties

[TargetProperty(name=pchembl_value_Mean, task=REGRESSION)]

In [23]:
dataset.storage

TabularStorageBasic (3286)

### Saving and Loading

The data structures in QSPRpred are also designed to be easily saved and reloaded from files to persist changes. We can easily save the data set to a file like this:

In [24]:
dataset.save()

This will save the data set into a folder we specified upon creation:

In [25]:
dataset.path

'/home/sichom/projects/QSPRpred/tutorials/tutorial_output/data/RepresentationTutorialDataset'

It will also update or save the underlying `ChemStore` object, which also lives in the same folder:

In [26]:
dataset.storage.path

'/home/sichom/projects/QSPRpred/tutorials/tutorial_output/data/RepresentationTutorialDataset_storage'

Therefore, storages and data sets can live in different folders and can be shared between projects. That means you can use the same storage for both your QSPR modelling and your docking project, for example. Both projects will have access to all data in your storage even if it changes over time, which can be useful for data management. 

Reloading the data set is easy as well. Every `PropertyStorage` gets a `fromFile` method that can be used to reload the instance from a saved snapshot:

In [27]:
dataset = QSPRDataset.fromFile(
    f"{dataset.path}/meta.json"
)
dataset.targetProperties

[TargetProperty(name=pchembl_value_Mean, task=REGRESSION)]

### Intermezzo on Molecule Standardization

Before doing any calculations, it is a good idea to standardize structures and drop invalid molecules, which is handled by the storage object itself. However, you can always override the standardizer associated with a `ChemStore` object or a `MoleculeTable` object. We can even write our own standardizer and use it to standardize the molecules before we do any calculations:

In [28]:
len(dataset)  # original length

3286

In [29]:
from qsprpred.data.chem.standardizers.base import ChemStandardizer


class MyStandardizer(ChemStandardizer):
    """A silly example standardizer that removes all molecules with halogens in them."""

    def __init__(self, halogens=None):
        self.halogens = halogens or ["F", "Cl", "Br", "I"]

    def convert_smiles(self, smiles) -> tuple[str | None, str]:
        """Discards all molecules with halogens in them.
        
        Returns:
            tuple[str | None, str]: 
                The first element is the standardized smiles, the second is the original.
                If the smiles should be discarded, return None as the first element.
        """
        for halogen in self.halogens:
            if halogen in smiles:
                return None, smiles  # return None to discard
        return smiles, smiles

    @property
    def settings(self):
        """Used to return the settings of the standardizer."""
        return {"halogens": self.halogens}

    def get_id(self):
        return ",".join(sorted(self.halogens))

    @classmethod
    def from_settings(cls, settings: dict):
        return cls(**settings)


dataset.applyStandardizer(
    MyStandardizer(["Br", "F", "I"])
)  # remove all molecules with bromine, fluorine or iodine in them
len(dataset)  # reduced length

3286

You can see that you are required to also implement a few more things than just the `convert_smiles` method. This is because standardizers should be explicit about their settings and it should be possible to compare them. This will help you find out if two storages or data sets are compatible with each other or if you need to unify the standardization process between them:

In [30]:
dataset.storage.standardizer.get_id()

'Br,F,I'

The standardizers used are saved with the storage so you can always retrieve them and check how the data was standardized:

In [31]:
dataset.save()
dataset = QSPRDataset.fromFile(
    f"{dataset.path}/meta.json"
)
dataset.storage.standardizer.settings

{'halogens': ['Br', 'F', 'I']}

In [32]:
dataset.storage.standardizer.get_id()

'Br,F,I'

### Calculating Molecular Descriptors

Once you have settled on your preferred data structure and standardized your data set, you can start calculating descriptors. The package already contains many descriptor implementations, but you can also easily add your own. We encourage you to check out the [descriptor tutorial](descriptors.ipynb) to learn more, but for the sake of completeness here is a simple example with Morgan fingerprints and RDKit descriptors:

In [33]:
from qsprpred.data.descriptors.fingerprints import MorganFP
from qsprpred.data.descriptors.sets import RDKitDescs

dataset.addDescriptors([MorganFP(radius=3, nBits=2048), RDKitDescs()])

Notice that since we are using the `TabularStorageBasic` as `ChemStore`, we can also speed these calculations up with parallelization, which is also covered in the [advanced tutorials](../../advanced/data/parallelization.ipynb):

In [34]:
dataset.storage.nJobs = os.cpu_count()

In [35]:
dataset.addDescriptors([MorganFP(radius=3, nBits=2048), RDKitDescs()], recalculate=True)



Descriptors are kept in their own wrapped tables, which can be accessed with the `descriptors` attribute:

In [36]:
dataset.descriptors

[DescriptorTable (3286), DescriptorTable (3286)]

For your convenience, these are nothing else, but specialized implementations of `PandasDataTable` objects, so you can use all the methods and attributes of the `PropertyStorage` API on them as well:

In [37]:
dataset.descriptors[1].getDF()

Unnamed: 0_level_0,RDkit_AvgIpc,RDkit_BCUT2D_CHGHI,RDkit_BCUT2D_CHGLO,RDkit_BCUT2D_LOGPHI,RDkit_BCUT2D_LOGPLOW,RDkit_BCUT2D_MRHI,RDkit_BCUT2D_MRLOW,RDkit_BCUT2D_MWHI,RDkit_BCUT2D_MWLOW,RDkit_BalabanJ,...,RDkit_fr_sulfone,RDkit_fr_term_acetylene,RDkit_fr_tetrazole,RDkit_fr_thiazole,RDkit_fr_thiocyan,RDkit_fr_thiophene,RDkit_fr_unbrch_alkane,RDkit_fr_urea,RDkit_qed,ID
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
CJFKDWZUTAKCKB-UHFFFAOYSA-N,3.369270,2.205644,-2.130499,2.304950,-2.258172,6.030517,0.647784,16.142281,10.071012,1.697764,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.371586,CJFKDWZUTAKCKB-UHFFFAOYSA-N
CJMZMTPUYHISPT-UHFFFAOYSA-N,3.324433,2.151936,-2.101078,2.227721,-2.165838,7.915207,-0.115145,32.233101,10.125881,1.583207,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.414226,CJMZMTPUYHISPT-UHFFFAOYSA-N
CJRNHKSLHHWUAB-UHFFFAOYSA-N,3.531243,2.190624,-2.082828,2.241043,-2.217917,5.998538,0.261656,16.465353,10.281546,1.526925,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.418088,CJRNHKSLHHWUAB-UHFFFAOYSA-N
CJUIJRFDRPPWHP-UHFFFAOYSA-N,3.249800,2.435803,-2.162295,2.317780,-2.326434,6.313636,-0.051059,35.495697,10.128447,1.626467,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.205431,CJUIJRFDRPPWHP-UHFFFAOYSA-N
CJWPDQKJIQQPSQ-UHFFFAOYSA-N,3.460251,2.191384,-2.106237,2.272459,-2.238651,5.994110,0.095132,16.507875,10.161134,1.615859,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.478837,CJWPDQKJIQQPSQ-UHFFFAOYSA-N
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
FBMQNRKSAWNXBT-UHFFFAOYSA-N,2.384542,2.319102,-2.195304,2.332777,-2.231408,6.311417,0.098008,16.144325,9.823120,2.409415,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.582602,FBMQNRKSAWNXBT-UHFFFAOYSA-N
FBQWVNMHUBKEIP-UHFFFAOYSA-N,3.217506,2.312863,-2.288011,2.395932,-2.421042,7.226017,-0.128153,32.133549,9.870557,1.542162,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.517273,FBQWVNMHUBKEIP-UHFFFAOYSA-N
FBWGREMSYFZWFR-UHFFFAOYSA-N,3.296196,2.157133,-2.097612,2.227686,-2.394732,5.831871,0.154743,16.547379,10.102615,1.544765,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.412912,FBWGREMSYFZWFR-UHFFFAOYSA-N
FCBHDSLDRNWTJS-UHFFFAOYSA-N,3.360151,2.285790,-2.110480,2.364816,-2.178803,7.193979,0.052705,32.134766,10.133849,1.952735,...,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.420972,FCBHDSLDRNWTJS-UHFFFAOYSA-N


## What's Next?

Now you know how data sets are represented in QSPRpred. Before you start modelling, you should also check out the [data preparation tutorial](data_preparation.ipynb) to learn how to prepare your data sets for modelling. This tutorial covers additional preparation steps such as feature filtering, selection and standardization through the `QSPRDataset.prepareDataset` method.