# Data Representation

In this tutorial, you will learn how data sets are represented in QSPRpred and how you can use the framework to store and prepare data sets not only for QSPR modeling, but general cheminformatics tasks as well.

In [1]:
import pandas as pd

df = pd.read_csv("../../tutorial_data/A2A_LIGANDS.tsv", sep="\t")

df.head()

Unnamed: 0,SMILES,pchembl_value_Mean,Year
0,Cc1nn(-c2cc(NC(=O)CCN(C)C)nc(-c3ccc(C)o3)n2)c(...,8.68,2008.0
1,Nc1c(C(=O)Nc2ccc([N+](=O)[O-])cc2)sc2c1cc1CCCC...,4.82,2010.0
2,O=C(Nc1nc2ncccc2n2c(=O)n(-c3ccccc3)nc12)c1ccccc1,5.65,2009.0
3,CNC(=O)C12CC1C(n1cnc3c1nc(C#CCCCCC(=O)OC)nc3NC...,5.45,2009.0
4,CCCn1c(=O)c2c(nc3cc(OC)ccn32)n(CCCNC(=O)c2ccc(...,5.2,2019.0


### `MoleculeTable` and `QSPRTable`

Let's take a look at the data structures you know from [the quick start](../../quick_start.ipynb) and how they are implemented. The `MoleculeTable` and `QSPRTable` classes are specific for QSPR modelling tasks and implement a selection of interfaces for this purpose. Check out entries for `MoleculeDataSet` and `QSPRDataSet` abstract classes in the [API documentation](https://cddleiden.github.io/QSPRpred/docs/api/modules.html) to see what they offer. The main thing to remember for this tutorial, however, is that `MoleculeTable` adds the ability to add and store molecular descriptors and `QSPRTable` is its subclass, which adds the ability to store information about target properties and modelling tasks. 

In order to initialize these with their respective constructors, you will need a `ChemStore` object:

In [2]:
from qsprpred.data.chem.identifiers import InchiIdentifier
from qsprpred.data.chem.standardizers.papyrus import PapyrusStandardizer
from qsprpred.data.storage.tabular.basic_storage import PandasChemStore
import os

storage = PandasChemStore(
    name="RepresentationTutorialChemStore",
    path="../../tutorial_output/data",
    df=df,
    smiles_col="SMILES",
    standardizer=PapyrusStandardizer(),  # standardizes the SMILES strings
    identifier=InchiIdentifier(),  # generates custom identifiers
    n_jobs=os.cpu_count()  # use all available CPUs
)
storage

PandasChemStore (4082)

You can read more about the `ChemStore` objects in the [advanced tutorials](../../advanced/data/data_representation.ipynb). In short, they are simply wrappers around a database or a folder structure containing molecules, their properties and other metadata. They support various operations that `MoleculeTable` and `QSPRTable` objects also take advantage of.

In [3]:
from qsprpred.data import MoleculeTable

mt = MoleculeTable(
    storage,  # ChemStore object 
    name="RepresentationTutorialMoleculeTable",
    path="../../tutorial_output/data",
    # determine where the molecule table associated data will live
)
mt.getDF()

Unnamed: 0_level_0,SMILES,pchembl_value_Mean,Year,original_smiles,ID,ID_before_change
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
AACWUFIIMOHGSO-UHFFFAOYSA-N,Cc1cc(C)n(-c2cc(NC(=O)CCN(C)C)nc(-c3ccc(C)o3)n...,8.68,2008.0,Cc1nn(-c2cc(NC(=O)CCN(C)C)nc(-c3ccc(C)o3)n2)c(...,AACWUFIIMOHGSO-UHFFFAOYSA-N,RepresentationTutorialChemStore_library_0000
AAEYTMMNWWKSKZ-UHFFFAOYSA-N,Nc1c(C(=O)Nc2ccc([N+](=O)[O-])cc2)sc2nc3c(cc12...,4.82,2010.0,Nc1c(C(=O)Nc2ccc([N+](=O)[O-])cc2)sc2c1cc1CCCC...,AAEYTMMNWWKSKZ-UHFFFAOYSA-N,RepresentationTutorialChemStore_library_0001
AAGFKZWKWAMJNP-UHFFFAOYSA-N,O=C(Nc1nc2ncccc2n2c(=O)n(-c3ccccc3)nc12)c1ccccc1,5.65,2009.0,O=C(Nc1nc2ncccc2n2c(=O)n(-c3ccccc3)nc12)c1ccccc1,AAGFKZWKWAMJNP-UHFFFAOYSA-N,RepresentationTutorialChemStore_library_0002
AANUKDYJZPKTKN-UHFFFAOYSA-N,CNC(=O)C12CC1C(n1cnc3c(NCc4cccc(Cl)c4)nc(C#CCC...,5.45,2009.0,CNC(=O)C12CC1C(n1cnc3c1nc(C#CCCCCC(=O)OC)nc3NC...,AANUKDYJZPKTKN-UHFFFAOYSA-N,RepresentationTutorialChemStore_library_0003
AASXHCGIIQCKEE-UHFFFAOYSA-N,CCCn1c(=O)c2c(nc3cc(OC)ccn32)n(CCCNC(=O)c2ccc(...,5.20,2019.0,CCCn1c(=O)c2c(nc3cc(OC)ccn32)n(CCCNC(=O)c2ccc(...,AASXHCGIIQCKEE-UHFFFAOYSA-N,RepresentationTutorialChemStore_library_0004
...,...,...,...,...,...,...
ZYXGKENMDDPQIE-UHFFFAOYSA-N,CNc1ncc(C(=O)NCc2ccc(OC)cc2)c2nc(-c3ccco3)nn12,7.09,2018.0,CNc1ncc(C(=O)NCc2ccc(OC)cc2)c2nc(-c3ccco3)nn12,ZYXGKENMDDPQIE-UHFFFAOYSA-N,RepresentationTutorialChemStore_library_4077
ZYZWFDVXMLCIOU-UHFFFAOYSA-N,Nc1nc(-c2ccco2)c2ncn(C(=O)NCCc3ccccc3)c2n1,8.22,2008.0,Nc1nc(-c2ccco2)c2ncn(C(=O)NCCc3ccccc3)c2n1,ZYZWFDVXMLCIOU-UHFFFAOYSA-N,RepresentationTutorialChemStore_library_4078
ZZBZWSYDXUPJCT-UHFFFAOYSA-N,Nc1nc(CSc2nnc(N)s2)nc(Nc2ccc(F)cc2)n1,4.89,2010.0,Nc1nc(Nc2ccc(F)cc2)nc(CSc2nnc(N)s2)n1,ZZBZWSYDXUPJCT-UHFFFAOYSA-N,RepresentationTutorialChemStore_library_4079
ZZMIPZLRKFEGIA-UHFFFAOYSA-N,CCCOc1ccc(C=Cc2cc3c(c(=O)n(C)c(=O)n3C)n2C)cc1,6.51,2013.0,CCCOc1ccc(C=Cc2cc3c(c(=O)n(C)c(=O)n3C)n2C)cc1,ZZMIPZLRKFEGIA-UHFFFAOYSA-N,RepresentationTutorialChemStore_library_4080


One of the features of the `ChemStore` and also `MoleculeTable` is the ability to iterate over molecules in a convenient way: 

In [4]:
for mol in mt:
    print(mol)
    print(mol.as_rd_mol())
    print(mol.smiles)
    print(mol.props)
    print(mol.representations)
    break

TabularMol (AACWUFIIMOHGSO-UHFFFAOYSA-N, Cc1cc(C)n(-c2cc(NC(=O)CCN(C)C)nc(-c3ccc(C)o3)n2)n1)
<rdkit.Chem.rdchem.Mol object at 0x7fef6c54cba0>
Cc1cc(C)n(-c2cc(NC(=O)CCN(C)C)nc(-c3ccc(C)o3)n2)n1
{'Year': 2008.0, 'ID_before_change': 'RepresentationTutorialChemStore_library_0000', 'SMILES': 'Cc1cc(C)n(-c2cc(NC(=O)CCN(C)C)nc(-c3ccc(C)o3)n2)n1', 'pchembl_value_Mean': 8.68, 'ID': 'AACWUFIIMOHGSO-UHFFFAOYSA-N', 'original_smiles': 'Cc1nn(-c2cc(NC(=O)CCN(C)C)nc(-c3ccc(C)o3)n2)c(C)c1'}
None


However, `ChemStore` objects are also subscriptable, which is also true for `MoleculeTable` objects:

In [5]:
mt['AACWUFIIMOHGSO-UHFFFAOYSA-N'].props

{'SMILES': 'Cc1cc(C)n(-c2cc(NC(=O)CCN(C)C)nc(-c3ccc(C)o3)n2)n1',
 'pchembl_value_Mean': 8.68,
 'Year': 2008.0,
 'original_smiles': 'Cc1nn(-c2cc(NC(=O)CCN(C)C)nc(-c3ccc(C)o3)n2)c(C)c1',
 'ID': 'AACWUFIIMOHGSO-UHFFFAOYSA-N',
 'ID_before_change': 'RepresentationTutorialChemStore_library_0000'}

There are many more convenience features like this, and we recommend you to check out the [API documentation](https://cddleiden.github.io/QSPRpred/docs/api/modules.html) or the [advanced tutorials](../../advanced/data/data_representation.ipynb) to learn more about them.

As already mentioned, `QSPRTable` is a subclass of `MoleculeTable`, which requires target properties to be defined in addition to the underlying `ChemStore` object:

In [6]:
from qsprpred import TargetTasks, TargetProperty

from qsprpred.data import QSPRTable

dataset = QSPRTable(
    storage,  # ChemStore object
    name="RepresentationTutorialDataset",
    path="../../tutorial_output/data",
    target_props=[TargetProperty("pchembl_value_Mean", TargetTasks.REGRESSION)]
)
dataset.targetProperties

[TargetProperty(name=pchembl_value_Mean, task=REGRESSION)]

But you can also create it by converting from a `MoleculeTable` object directly:

In [7]:
dataset = QSPRTable.fromMolTable(mt, target_props=[
    TargetProperty("pchembl_value_Mean", TargetTasks.REGRESSION)
])
dataset.targetProperties

[TargetProperty(name=pchembl_value_Mean, task=REGRESSION)]

However, if you already have a data frame prepared, you can use it directly with the `fromDF` method:

In [8]:
dataset = QSPRTable.fromDF(
    name="RepresentationTutorialDataset",
    df=df,
    path="../../tutorial_output/data",
    target_props=[TargetProperty("pchembl_value_Mean", TargetTasks.REGRESSION)],
    smiles_col="SMILES"
)
dataset.targetProperties

[TargetProperty(name=pchembl_value_Mean, task=REGRESSION)]

This will create the `ChemStore` object for you:

In [9]:
dataset.storage

PandasChemStore (4082)

This has some drawbacks, however. Note that it will not apply standardization and unique identification to the structures for you like in the example above where we initialized the storage ourselves. It assumes you have already done the work and will prepare the compounds also before using the resulting models for prediction. You can find out more about standardization in the [advanced tutorial](../../advanced/data/data_representation.ipynb).

### Saving and Loading

The data structures in QSPRpred are also designed to be easily saved and reloaded from files to persist changes. We can easily save the data set to a file like this:

In [10]:
dataset.save()

This will save the data set into a folder we specified upon creation:

In [11]:
dataset.path

'/home/sichom/projects/QSPRpred/tutorials/tutorial_output/data/RepresentationTutorialDataset'

It will also update or save the underlying `ChemStore` object, which also lives in the same folder:

In [12]:
dataset.storage.path

'/home/sichom/projects/QSPRpred/tutorials/tutorial_output/data/RepresentationTutorialDataset_storage'

Therefore, storages and data sets can live in different folders and can be shared between projects. That means you can use the same storage for both your QSPR modelling and your docking project, for example. Both projects will have access to all data in your storage even if it changes over time, which can be useful for data management. 

Reloading the data set is easy as well. Every instance gets a `fromFile` method that can be used to reload the instance from a saved snapshot:

In [13]:
dataset = QSPRTable.fromFile(
    f"{dataset.path}/meta.json"
)
dataset.targetProperties

[TargetProperty(name=pchembl_value_Mean, task=REGRESSION)]

### Calculating Molecular Descriptors

Once you have settled on your preferred data structure and standardized your data set, you can start calculating descriptors. The package already contains many descriptor implementations, but you can also easily add your own. We encourage you to check out the [descriptor tutorial](descriptors.ipynb) to learn more, but for the sake of completeness here is a simple example with Morgan fingerprints and RDKit descriptors:

In [14]:
from qsprpred.data.descriptors.fingerprints import MorganFP
from qsprpred.data.descriptors.sets import RDKitDescs

dataset.addDescriptors([MorganFP(radius=3, nBits=2048), RDKitDescs()])

Notice that since we are using the `PandasChemStore` as `ChemStore` for the data set, we can also speed these calculations up with parallelization:

In [15]:
dataset.nJobs = os.cpu_count()

In [16]:
dataset.addDescriptors([MorganFP(radius=3, nBits=2048), RDKitDescs()], recalculate=True)

**Note:** More details on parallelization through storages can be found in the [advanced tutorials](../../advanced/data/parallelization.ipynb).

Descriptors are kept in their own wrapped tables, which can be accessed with the `descriptors` attribute:

In [17]:
dataset.descriptors

[DescriptorTable (4082), DescriptorTable (4082)]

For your convenience, these are nothing else, but specialized implementations of `PandasDataTable` objects, so you can use all the methods and attributes on them as well:

In [18]:
dataset.descriptors[1].getDF()

Unnamed: 0_level_0,RDkit_AvgIpc,RDkit_BCUT2D_CHGHI,RDkit_BCUT2D_CHGLO,RDkit_BCUT2D_LOGPHI,RDkit_BCUT2D_LOGPLOW,RDkit_BCUT2D_MRHI,RDkit_BCUT2D_MRLOW,RDkit_BCUT2D_MWHI,RDkit_BCUT2D_MWLOW,RDkit_BalabanJ,...,RDkit_fr_sulfone,RDkit_fr_term_acetylene,RDkit_fr_tetrazole,RDkit_fr_thiazole,RDkit_fr_thiocyan,RDkit_fr_thiophene,RDkit_fr_unbrch_alkane,RDkit_fr_urea,RDkit_qed,ID
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
CMCQWYYGEHUAAM-UHFFFAOYSA-N,3.143164,2.149394,-2.213343,2.314176,-2.255821,7.182521,0.293136,35.495705,10.220235,1.674376,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.727124,CMCQWYYGEHUAAM-UHFFFAOYSA-N
CMGVRLPXTCGYEI-UHFFFAOYSA-N,3.560957,2.208343,-2.143250,2.270584,-2.221644,5.898911,0.268010,16.465364,10.118351,1.270270,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.464554,CMGVRLPXTCGYEI-UHFFFAOYSA-N
CMHQDWHZIUMRGC-UHFFFAOYSA-N,3.444091,2.472842,-2.163351,2.325829,-2.421564,7.118022,-0.137029,35.495701,10.109233,1.600798,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.246801,CMHQDWHZIUMRGC-UHFFFAOYSA-N
CMLKEDYTLQARSM-UHFFFAOYSA-N,3.232592,2.132376,-2.032054,2.300074,-2.040877,7.208708,1.096688,32.134693,10.432814,1.666281,...,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.588101,CMLKEDYTLQARSM-UHFFFAOYSA-N
CMMXTSNUEHRJEY-UHFFFAOYSA-N,3.285367,2.436150,-2.162489,2.318410,-2.326463,5.826612,-0.051069,16.634863,10.128347,1.777639,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.325304,CMMXTSNUEHRJEY-UHFFFAOYSA-N
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
UNFCWBZGDOODJH-UHFFFAOYSA-N,3.250113,2.136462,-2.051518,2.313299,-2.086214,7.208439,0.581795,32.133545,10.149388,1.536101,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.578341,UNFCWBZGDOODJH-UHFFFAOYSA-N
UNGUJUIOUVNWBB-UHFFFAOYSA-N,3.212292,2.172590,-2.078829,2.231658,-2.179798,5.897337,0.207708,16.333868,10.374295,1.943525,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.579983,UNGUJUIOUVNWBB-UHFFFAOYSA-N
UNGYPXFMYYJLOR-UHFFFAOYSA-N,3.134793,2.193445,-2.086792,2.239148,-2.203306,6.058907,0.102176,16.153765,10.187603,1.779438,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.456132,UNGYPXFMYYJLOR-UHFFFAOYSA-N
UNIUKKWPGRGSRQ-UHFFFAOYSA-N,2.636158,2.129917,-2.014148,2.409279,-1.750494,9.109511,0.469374,79.919762,10.160447,2.618619,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.650118,UNIUKKWPGRGSRQ-UHFFFAOYSA-N


## What's Next?

Now you know how data sets are represented in QSPRpred. Before you start modelling, you should also check out the [data preparation tutorial](data_preparation.ipynb) to learn how to prepare your data sets for modelling. This tutorial covers additional preparation steps such as feature filtering, selection and standardization through the `QSPRTable.prepareDataset` method.