# Data Representation

In this tutorial, you will learn how data sets are represented in QSPRpred and how you can use the framework to store and prepare data sets not only for QSPR modeling, but general cheminformatics tasks as well.

In [1]:
import pandas as pd

df = pd.read_csv("../../tutorial_data/A2A_LIGANDS.tsv", sep="\t")

df.head()

Unnamed: 0,SMILES,pchembl_value_Mean,Year
0,Cc1nn(-c2cc(NC(=O)CCN(C)C)nc(-c3ccc(C)o3)n2)c(...,8.68,2008.0
1,Nc1c(C(=O)Nc2ccc([N+](=O)[O-])cc2)sc2c1cc1CCCC...,4.82,2010.0
2,O=C(Nc1nc2ncccc2n2c(=O)n(-c3ccccc3)nc12)c1ccccc1,5.65,2009.0
3,CNC(=O)C12CC1C(n1cnc3c1nc(C#CCCCCC(=O)OC)nc3NC...,5.45,2009.0
4,CCCn1c(=O)c2c(nc3cc(OC)ccn32)n(CCCNC(=O)c2ccc(...,5.2,2019.0


### `MoleculeTable` and `QSPRDataset`

Let's take a look at the data structures you know from [the quick start](../../quick_start.ipynb) and how they are implemented. The `MoleculeTable` and `QSPRDataset` classes are specific for QSPR modelling tasks and implement a selection of interfaces for this purpose. Check out entries for `MoleculeDataSet` and `QSPRDataSet` abstract classes in the [API documentation](https://cddleiden.github.io/QSPRpred/docs/api/modules.html) to see what they offer. The main thing to remember for this tutorial, however, is that `MoleculeTable` adds the ability to add and store molecular descriptors and `QSPRDataset` is its subclass, which adds the ability to store information about target properties and modelling tasks. 

These two classes are actually initialized from `ChemStore` instances:

In [2]:
from qsprpred.data.chem.identifiers import InchiIdentifier
from qsprpred.data.chem.standardizers.papyrus import PapyrusStandardizer
from qsprpred.data.storage.tabular.basic_storage import TabularStorageBasic
import os

storage = TabularStorageBasic(
    name="RepresentationTutorialChemStore",
    path="../../tutorial_output/data",
    df=df,
    smiles_col="SMILES",
    standardizer=PapyrusStandardizer(),  # standardizes the SMILES strings
    identifier=InchiIdentifier(),  # generates custom identifiers
    n_jobs=os.cpu_count()  # use all available CPUs
)
storage

TabularStorageBasic (4082)

In [3]:
from qsprpred.data import MoleculeTable

mt = MoleculeTable(
    storage,  # ChemStore object 
    name="RepresentationTutorialMoleculeTable",
    path="../../tutorial_output/data",
    # where the molecule table associated data will live
)
mt.getDF()

Unnamed: 0_level_0,SMILES,pchembl_value_Mean,Year,original_smiles,ID,ID_before_change
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
AACWUFIIMOHGSO-UHFFFAOYSA-N,Cc1cc(C)n(-c2cc(NC(=O)CCN(C)C)nc(-c3ccc(C)o3)n...,8.68,2008.0,Cc1nn(-c2cc(NC(=O)CCN(C)C)nc(-c3ccc(C)o3)n2)c(...,AACWUFIIMOHGSO-UHFFFAOYSA-N,RepresentationTutorialChemStore_library_0000
AAEYTMMNWWKSKZ-UHFFFAOYSA-N,Nc1c(C(=O)Nc2ccc([N+](=O)[O-])cc2)sc2nc3c(cc12...,4.82,2010.0,Nc1c(C(=O)Nc2ccc([N+](=O)[O-])cc2)sc2c1cc1CCCC...,AAEYTMMNWWKSKZ-UHFFFAOYSA-N,RepresentationTutorialChemStore_library_0001
AAGFKZWKWAMJNP-UHFFFAOYSA-N,O=C(Nc1nc2ncccc2n2c(=O)n(-c3ccccc3)nc12)c1ccccc1,5.65,2009.0,O=C(Nc1nc2ncccc2n2c(=O)n(-c3ccccc3)nc12)c1ccccc1,AAGFKZWKWAMJNP-UHFFFAOYSA-N,RepresentationTutorialChemStore_library_0002
AANUKDYJZPKTKN-UHFFFAOYSA-N,CNC(=O)C12CC1C(n1cnc3c(NCc4cccc(Cl)c4)nc(C#CCC...,5.45,2009.0,CNC(=O)C12CC1C(n1cnc3c1nc(C#CCCCCC(=O)OC)nc3NC...,AANUKDYJZPKTKN-UHFFFAOYSA-N,RepresentationTutorialChemStore_library_0003
AASXHCGIIQCKEE-UHFFFAOYSA-N,CCCn1c(=O)c2c(nc3cc(OC)ccn32)n(CCCNC(=O)c2ccc(...,5.20,2019.0,CCCn1c(=O)c2c(nc3cc(OC)ccn32)n(CCCNC(=O)c2ccc(...,AASXHCGIIQCKEE-UHFFFAOYSA-N,RepresentationTutorialChemStore_library_0004
...,...,...,...,...,...,...
ZYXGKENMDDPQIE-UHFFFAOYSA-N,CNc1ncc(C(=O)NCc2ccc(OC)cc2)c2nc(-c3ccco3)nn12,7.09,2018.0,CNc1ncc(C(=O)NCc2ccc(OC)cc2)c2nc(-c3ccco3)nn12,ZYXGKENMDDPQIE-UHFFFAOYSA-N,RepresentationTutorialChemStore_library_4077
ZYZWFDVXMLCIOU-UHFFFAOYSA-N,Nc1nc(-c2ccco2)c2ncn(C(=O)NCCc3ccccc3)c2n1,8.22,2008.0,Nc1nc(-c2ccco2)c2ncn(C(=O)NCCc3ccccc3)c2n1,ZYZWFDVXMLCIOU-UHFFFAOYSA-N,RepresentationTutorialChemStore_library_4078
ZZBZWSYDXUPJCT-UHFFFAOYSA-N,Nc1nc(CSc2nnc(N)s2)nc(Nc2ccc(F)cc2)n1,4.89,2010.0,Nc1nc(Nc2ccc(F)cc2)nc(CSc2nnc(N)s2)n1,ZZBZWSYDXUPJCT-UHFFFAOYSA-N,RepresentationTutorialChemStore_library_4079
ZZMIPZLRKFEGIA-UHFFFAOYSA-N,CCCOc1ccc(C=Cc2cc3c(c(=O)n(C)c(=O)n3C)n2C)cc1,6.51,2013.0,CCCOc1ccc(C=Cc2cc3c(c(=O)n(C)c(=O)n3C)n2C)cc1,ZZMIPZLRKFEGIA-UHFFFAOYSA-N,RepresentationTutorialChemStore_library_4080


`ChemStore` is basically a wrapper around a databse or folder structure containing molecules and it supports some operations on the molecules themselves and their properties. You can see more in [this advanced tutorial](../advanced/data_representation.ipynb). Thanks to that we can perform various operations on the molecule table we just created: 

In [4]:
for mol in mt:
    print(mol)
    print(mol.as_rd_mol())
    print(mol.smiles)
    print(mol.props)
    print(mol.representations)
    break

TabularMol(AACWUFIIMOHGSO-UHFFFAOYSA-N, Cc1cc(C)n(-c2cc(NC(=O)CCN(C)C)nc(-c3ccc(C)o3)n2)n1)
<rdkit.Chem.rdchem.Mol object at 0x7f38e7109ee0>
Cc1cc(C)n(-c2cc(NC(=O)CCN(C)C)nc(-c3ccc(C)o3)n2)n1
{'pchembl_value_Mean': 8.68, 'ID_before_change': 'RepresentationTutorialChemStore_library_0000', 'ID': 'AACWUFIIMOHGSO-UHFFFAOYSA-N', 'SMILES': 'Cc1cc(C)n(-c2cc(NC(=O)CCN(C)C)nc(-c3ccc(C)o3)n2)n1', 'Year': 2008.0, 'original_smiles': 'Cc1nn(-c2cc(NC(=O)CCN(C)C)nc(-c3ccc(C)o3)n2)c(C)c1'}
None


Note that `ChemStore` objects are also subscriptable, which is also true for `MoleculeTable` objects:

In [5]:
mt['AACWUFIIMOHGSO-UHFFFAOYSA-N'].props

{'SMILES': 'Cc1cc(C)n(-c2cc(NC(=O)CCN(C)C)nc(-c3ccc(C)o3)n2)n1',
 'pchembl_value_Mean': 8.68,
 'Year': 2008.0,
 'original_smiles': 'Cc1nn(-c2cc(NC(=O)CCN(C)C)nc(-c3ccc(C)o3)n2)c(C)c1',
 'ID': 'AACWUFIIMOHGSO-UHFFFAOYSA-N',
 'ID_before_change': 'RepresentationTutorialChemStore_library_0000'}

`QSPRDataset` is a subclass of `MoleculeTable`, which requires target properties to be defined in addition to the underlying `ChemStore` object:

In [6]:
from qsprpred import TargetTasks, TargetProperty

from qsprpred.data import QSPRDataset

dataset = QSPRDataset(
    storage,  # ChemStore object
    name="RepresentationTutorialDataset",
    path="../../tutorial_output/data",
    target_props=[TargetProperty("pchembl_value_Mean", TargetTasks.REGRESSION)]
)
dataset.targetProperties

[TargetProperty(name=pchembl_value_Mean, task=REGRESSION)]

But you can also create it by converting from a `MoleculeTable` object:

In [7]:
dataset = QSPRDataset.fromMolTable(mt, target_props=[
    TargetProperty("pchembl_value_Mean", TargetTasks.REGRESSION)
])
dataset.targetProperties

[TargetProperty(name=pchembl_value_Mean, task=REGRESSION)]

And you can also go directly from a data frame, which will create the underlying `ChemStore` object for you:

In [8]:
dataset = QSPRDataset.fromDF(
    name="RepresentationTutorialDataset",
    df=df,
    path="../../tutorial_output/data",
    target_props=[TargetProperty("pchembl_value_Mean", TargetTasks.REGRESSION)],
    smiles_col="SMILES"
)
dataset.targetProperties

[TargetProperty(name=pchembl_value_Mean, task=REGRESSION)]

In [9]:
dataset.storage

TabularStorageBasic (4082)

### Saving and Loading

The data structures in QSPRpred are also designed to be easily saved and reloaded from files to persist changes. We can easily save the data set to a file like this:

In [10]:
dataset.save()

This will save the data set into a folder we specified upon creation:

In [11]:
dataset.path

'/home/sichom/projects/QSPRpred/tutorials/tutorial_output/data/RepresentationTutorialDataset'

It will also update or save the underlying `ChemStore` object, which also lives in the same folder:

In [12]:
dataset.storage.path

'/home/sichom/projects/QSPRpred/tutorials/tutorial_output/data/RepresentationTutorialDataset_storage'

Therefore, storages and data sets can live in different folders and can be shared between projects. That means you can use the same storage for both your QSPR modelling and your docking project, for example. Both projects will have access to all data in your storage even if it changes over time, which can be useful for data management. 

Reloading the data set is easy as well. Every `PropertyStorage` gets a `fromFile` method that can be used to reload the instance from a saved snapshot:

In [13]:
dataset = QSPRDataset.fromFile(
    f"{dataset.path}/meta.json"
)
dataset.targetProperties

[TargetProperty(name=pchembl_value_Mean, task=REGRESSION)]

### Intermezzo on Molecule Standardization

Before doing any calculations, it is a good idea to standardize structures and drop invalid molecules, which is handled by the storage object itself. However, you can always override the standardizer associated with a `ChemStore` object or a `MoleculeTable` object. We can even write our own standardizer and use it to standardize the molecules before we do any calculations:

In [14]:
len(dataset)  # original length

4082

In [15]:
from qsprpred.data.chem.standardizers.base import ChemStandardizer


class MyStandardizer(ChemStandardizer):
    """A silly example standardizer that removes all molecules with halogens in them."""

    def __init__(self, halogens=None):
        self.halogens = halogens or ["F", "Cl", "Br", "I"]

    def convertSMILES(self, smiles) -> str | None:
        """Discards all molecules with halogens in them.
        
        Returns:
            str | None: 
                The standardized SMILES string or None if the molecule should be discarded.
        """
        for halogen in self.halogens:
            if halogen in smiles:
                return None  # return None to discard
        return smiles

    @property
    def settings(self):
        """Used to return the settings of the standardizer."""
        return {"halogens": self.halogens}

    def getID(self):
        return ",".join(sorted(self.halogens))

    @classmethod
    def fromSettings(cls, settings: dict):
        return cls(**settings)


dataset.applyStandardizer(
    MyStandardizer(["Br", "F", "I"])
)  # remove all molecules with bromine, fluorine or iodine in them
len(dataset)  # reduced length



3286

You can see that you are required to also implement a few more things than just the `convert_smiles` method. This is because standardizers should be explicit about their settings and it should be possible to compare them. This will help you find out if two storages or data sets are compatible with each other or if you need to unify the standardization process between them:

In [16]:
dataset.storage.standardizer.getID()

'Br,F,I'

The standardizers used are saved with the storage so you can always retrieve them and check how the data was standardized:

In [17]:
dataset.save()
dataset = QSPRDataset.fromFile(
    f"{dataset.path}/meta.json"
)
dataset.storage.standardizer.settings

{'halogens': ['Br', 'F', 'I']}

In [18]:
dataset.storage.standardizer.getID()

'Br,F,I'

QSPRpred offers a few standardizers in the `qspr.data.chem.standardizers` package so feel free to look at the documentation of this package.

### Calculating Molecular Descriptors

Once you have settled on your preferred data structure and standardized your data set, you can start calculating descriptors. The package already contains many descriptor implementations, but you can also easily add your own. We encourage you to check out the [descriptor tutorial](descriptors.ipynb) to learn more, but for the sake of completeness here is a simple example with Morgan fingerprints and RDKit descriptors:

In [19]:
from qsprpred.data.descriptors.fingerprints import MorganFP
from qsprpred.data.descriptors.sets import RDKitDescs

dataset.addDescriptors([MorganFP(radius=3, nBits=2048), RDKitDescs()])

Notice that since we are using the `TabularStorageBasic` as `ChemStore` for the data set, we can also speed these calculations up with parallelization:

In [20]:
dataset.nJobs = os.cpu_count()

In [21]:
dataset.addDescriptors([MorganFP(radius=3, nBits=2048), RDKitDescs()], recalculate=True)

**Note:** More details on parallelization through storages can be found in the [advanced tutorials](../../advanced/data/parallelization.ipynb).

Descriptors are kept in their own wrapped tables, which can be accessed with the `descriptors` attribute:

In [22]:
dataset.descriptors

[DescriptorTable (3286), DescriptorTable (3286)]

For your convenience, these are nothing else, but specialized implementations of `PandasDataTable` objects, so you can use all the methods and attributes of the `PropertyStorage` API on them as well:

In [23]:
dataset.descriptors[1].getDF()

Unnamed: 0_level_0,RDkit_AvgIpc,RDkit_BCUT2D_CHGHI,RDkit_BCUT2D_CHGLO,RDkit_BCUT2D_LOGPHI,RDkit_BCUT2D_LOGPLOW,RDkit_BCUT2D_MRHI,RDkit_BCUT2D_MRLOW,RDkit_BCUT2D_MWHI,RDkit_BCUT2D_MWLOW,RDkit_BalabanJ,...,RDkit_fr_sulfone,RDkit_fr_term_acetylene,RDkit_fr_tetrazole,RDkit_fr_thiazole,RDkit_fr_thiocyan,RDkit_fr_thiophene,RDkit_fr_unbrch_alkane,RDkit_fr_urea,RDkit_qed,ID
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
AUBIOACYLHNVOM-UHFFFAOYSA-N,3.143765,2.147943,-2.064522,2.152083,-2.260265,5.904432,0.093952,16.462145,10.272797,1.720541,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.684213,AUBIOACYLHNVOM-UHFFFAOYSA-N
AUNXSAUXRUCPOQ-UHFFFAOYSA-N,3.236408,2.148100,-2.038914,2.327026,-1.967526,7.208427,0.414717,32.133545,10.172657,1.857624,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.619291,AUNXSAUXRUCPOQ-UHFFFAOYSA-N
AUORTGQKWNTDKT-UHFFFAOYSA-N,2.651005,2.149889,-2.122324,2.239270,-2.313572,5.967520,0.091172,16.474794,10.100842,1.803300,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.665445,AUORTGQKWNTDKT-UHFFFAOYSA-N
AUQHNAQGJBULMA-UHFFFAOYSA-N,2.879929,2.106531,-1.952251,2.285934,-2.070887,7.183883,1.017097,32.134708,10.291913,2.419333,...,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.702394,AUQHNAQGJBULMA-UHFFFAOYSA-N
AVCVCCQJIMUKOJ-UHFFFAOYSA-N,3.578061,2.416188,-2.321011,2.341634,-2.495074,6.182725,0.226536,16.486029,10.163918,1.344940,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.464362,AVCVCCQJIMUKOJ-UHFFFAOYSA-N
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
ZXFPGFBEAAFTOL-UHFFFAOYSA-N,2.673123,2.171015,-2.111679,2.324862,-2.186289,8.143559,0.264614,32.166599,9.991602,2.385960,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.861072,ZXFPGFBEAAFTOL-UHFFFAOYSA-N
ZXNLEZIGIRYUMA-UHFFFAOYSA-N,2.973333,2.159815,-2.063629,2.380729,-2.191988,5.708237,0.367977,16.327944,10.116677,1.951521,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.398507,ZXNLEZIGIRYUMA-UHFFFAOYSA-N
ZXPDGTGMZKIESV-UHFFFAOYSA-N,2.962669,2.748903,-2.232095,2.673987,-2.410849,6.283495,-0.131544,35.495701,9.981504,1.537104,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.554752,ZXPDGTGMZKIESV-UHFFFAOYSA-N
ZXUDFYXHMFXRAZ-UHFFFAOYSA-N,3.012530,2.139712,-2.086109,2.313124,-1.989427,6.301805,0.579939,35.495693,9.997193,2.224349,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.771410,ZXUDFYXHMFXRAZ-UHFFFAOYSA-N


## What's Next?

Now you know how data sets are represented in QSPRpred. Before you start modelling, you should also check out the [data preparation tutorial](data_preparation.ipynb) to learn how to prepare your data sets for modelling. This tutorial covers additional preparation steps such as feature filtering, selection and standardization through the `QSPRDataset.prepareDataset` method.