# Descriptor calculation
This tutorial provides an overview of the descriptors that can be calculated with QSPRpred.

First, we will import the necessary modules and load the dataset that we will use for this tutorial.

In [1]:
from qsprpred.data import MoleculeTable
import os

os.makedirs("../../tutorial_output/data", exist_ok=True)

dataset = MoleculeTable.fromTableFile(
    filename="../../tutorial_data/A2A_LIGANDS.tsv",
    store_dir="../../tutorial_output/data",
    name="DescriptorsTutorialDataset",
)

dataset.getDF()

Failed to find the pandas get_adjustment() function to patch
Failed to patch pandas - PandasTools will have limited functionality


Unnamed: 0_level_0,SMILES,pchembl_value_Mean,Year,QSPRID
QSPRID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
DescriptorsTutorialDataset_0000,Cc1nn(-c2cc(NC(=O)CCN(C)C)nc(-c3ccc(C)o3)n2)c(...,8.68,2008.0,DescriptorsTutorialDataset_0000
DescriptorsTutorialDataset_0001,Nc1c(C(=O)Nc2ccc([N+](=O)[O-])cc2)sc2c1cc1CCCC...,4.82,2010.0,DescriptorsTutorialDataset_0001
DescriptorsTutorialDataset_0002,O=C(Nc1nc2ncccc2n2c(=O)n(-c3ccccc3)nc12)c1ccccc1,5.65,2009.0,DescriptorsTutorialDataset_0002
DescriptorsTutorialDataset_0003,CNC(=O)C12CC1C(n1cnc3c1nc(C#CCCCCC(=O)OC)nc3NC...,5.45,2009.0,DescriptorsTutorialDataset_0003
DescriptorsTutorialDataset_0004,CCCn1c(=O)c2c(nc3cc(OC)ccn32)n(CCCNC(=O)c2ccc(...,5.20,2019.0,DescriptorsTutorialDataset_0004
...,...,...,...,...
DescriptorsTutorialDataset_4077,CNc1ncc(C(=O)NCc2ccc(OC)cc2)c2nc(-c3ccco3)nn12,7.09,2018.0,DescriptorsTutorialDataset_4077
DescriptorsTutorialDataset_4078,Nc1nc(-c2ccco2)c2ncn(C(=O)NCCc3ccccc3)c2n1,8.22,2008.0,DescriptorsTutorialDataset_4078
DescriptorsTutorialDataset_4079,Nc1nc(Nc2ccc(F)cc2)nc(CSc2nnc(N)s2)n1,4.89,2010.0,DescriptorsTutorialDataset_4079
DescriptorsTutorialDataset_4080,CCCOc1ccc(C=Cc2cc3c(c(=O)n(C)c(=O)n3C)n2C)cc1,6.51,2013.0,DescriptorsTutorialDataset_4080


## Descriptor sets

In QSPRpred, descriptors are organized into sets. Each `Descriptorset` can contain one or more types of descriptors. For example, `RDKitDescs`, will contain all physciochemical properties calculated with the RDKit, while `Mordred` will contain descriptors calculated with Mordred.

Descriptors can be added to a dataset using the `addDescriptors` method. This method will calculate the descriptors and add them to the dataset.

In [2]:
from qsprpred.data.descriptors.sets import RDKitDescs

rdkit_descs = RDKitDescs()

dataset.addDescriptors([rdkit_descs])

dataset.descriptorSets

[<qsprpred.data.descriptors.sets.RDKitDescs at 0x7fdec67b6c20>]

With the `getDescriptorsNames` method, we can retrieve the names of the calculated descriptors from the dataset and with the `getDescriptors` method, we can retrieve the calculated descriptors themselves.

In [3]:
display(dataset.getDescriptorNames()[0:10])

display(dataset.getDescriptors().head())


['AvgIpc',
 'BCUT2D_CHGHI',
 'BCUT2D_CHGLO',
 'BCUT2D_LOGPHI',
 'BCUT2D_LOGPLOW',
 'BCUT2D_MRHI',
 'BCUT2D_MRLOW',
 'BCUT2D_MWHI',
 'BCUT2D_MWLOW',
 'BalabanJ']

Unnamed: 0_level_0,AvgIpc,BCUT2D_CHGHI,BCUT2D_CHGLO,BCUT2D_LOGPHI,BCUT2D_LOGPLOW,BCUT2D_MRHI,BCUT2D_MRLOW,BCUT2D_MWHI,BCUT2D_MWLOW,BalabanJ,...,fr_sulfonamd,fr_sulfone,fr_term_acetylene,fr_tetrazole,fr_thiazole,fr_thiocyan,fr_thiophene,fr_unbrch_alkane,fr_urea,qed
QSPRID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
DescriptorsTutorialDataset_0000,3.175462,2.146917,-2.111585,2.224211,-2.211801,5.89803,-0.115978,16.34251,10.325111,1.97945,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.719563
DescriptorsTutorialDataset_0001,2.962034,2.203517,-2.136048,2.354533,-2.114561,7.208763,-0.384429,32.133541,9.952695,1.63729,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.537523
DescriptorsTutorialDataset_0002,3.127957,2.1882,-2.067268,2.189363,-2.20295,6.053281,0.102176,16.153774,10.19078,1.748699,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.517728
DescriptorsTutorialDataset_0003,3.491801,2.749014,-2.231626,2.674016,-2.410868,6.301607,-0.140325,35.495693,9.981499,1.448422,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.173447
DescriptorsTutorialDataset_0004,3.236332,2.185263,-2.113838,2.178416,-2.404113,7.903051,0.095234,32.227749,10.184907,1.581267,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.334943


Descriptors can also be calculate through the `prepareDataset` method of the `QSPRDataset` class.

In [6]:
from qsprpred.data import QSPRDataset
from qsprpred.data.descriptors.fingerprints import MorganFP

qspr_dataset = QSPRDataset.fromMolTable(
    dataset,
    target_props=[{"name": "pchembl_value_Mean", "task": "REGRESSION"}],
    name = "DescriptorsTutorialQSPRDataset"
)

qspr_dataset.prepareDataset(feature_calculators=[MorganFP(radius=2, nBits=128)])

print(qspr_dataset.descriptorSets)
qspr_dataset.getDescriptors().head()

[<qsprpred.data.descriptors.sets.RDKitDescs object at 0x7fdec67b6c20>, <qsprpred.data.descriptors.fingerprints.MorganFP object at 0x7fde02fca290>]


Unnamed: 0_level_0,AvgIpc,BCUT2D_CHGHI,BCUT2D_CHGLO,BCUT2D_LOGPHI,BCUT2D_LOGPLOW,BCUT2D_MRHI,BCUT2D_MRLOW,BCUT2D_MWHI,BCUT2D_MWLOW,BalabanJ,...,MorganFP_118,MorganFP_119,MorganFP_120,MorganFP_121,MorganFP_122,MorganFP_123,MorganFP_124,MorganFP_125,MorganFP_126,MorganFP_127
QSPRID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
DescriptorsTutorialDataset_0000,3.175462,2.146917,-2.111585,2.224211,-2.211801,5.89803,-0.115978,16.34251,10.325111,1.97945,...,False,False,False,True,True,True,True,True,False,False
DescriptorsTutorialDataset_0001,2.962034,2.203517,-2.136048,2.354533,-2.114561,7.208763,-0.384429,32.133541,9.952695,1.63729,...,False,False,False,False,True,False,True,True,True,True
DescriptorsTutorialDataset_0002,3.127957,2.1882,-2.067268,2.189363,-2.20295,6.053281,0.102176,16.153774,10.19078,1.748699,...,False,False,True,False,True,False,False,True,False,False
DescriptorsTutorialDataset_0003,3.491801,2.749014,-2.231626,2.674016,-2.410868,6.301607,-0.140325,35.495693,9.981499,1.448422,...,True,False,False,False,True,True,True,True,False,False
DescriptorsTutorialDataset_0004,3.236332,2.185263,-2.113838,2.178416,-2.404113,7.903051,0.095234,32.227749,10.184907,1.581267,...,False,False,True,False,True,False,True,True,False,False


Note. if applying any feature standardization or feature filtering, this will not be reflected in the descriptor dataframe. Instead the original descriptors will be returned by the `getDescriptors` method. To retrieve the standardized or filtered descriptors, use the `getFeatures` method.

Descriptorsets can also be used to directly calculate descriptors for a list of molecules using the `calculateDescriptors` method.

In [9]:
from qsprpred.data.descriptors.fingerprints import MorganFP
from rdkit import Chem

smiles = ["CC(=O)NC1=CC=C(C=C1)O", "CN1CCC23C4C1CC5=C2C(=C(C=C5)O)OC3C(C=C4)O"]
mols = [Chem.MolFromSmiles(smiles) for smiles in smiles]

morgan_fp = MorganFP(radius=2, nBits=128)

morgan_fp.getDescriptors(mols, props=None)

array([[1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0.,
        0., 0., 1., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0.,
        0., 1., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1.,
        0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0.,
        0., 1., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 1., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0.,
        0., 0., 1., 0., 0., 1., 0., 0., 0., 1., 0., 0., 0., 1., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 1., 0., 0., 0., 1., 1.,
        1., 0., 0., 1., 0., 1., 1., 0., 1., 1., 0., 0., 0., 1., 1., 0.,
        0., 1., 0., 0., 1., 0., 0., 1., 1., 0., 1., 0., 1., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 1., 0., 1., 0., 1., 1., 0., 0., 0.,
        0., 1., 1., 1., 1., 0., 0., 1., 1., 0., 0., 0., 0., 1., 0., 0.,
        0., 1., 0., 1., 1., 0., 1., 0., 0., 0., 0., 0., 0., 0.,

## Examples

Here, we will go over a few examples of the available descriptor sets. The full list of available descriptor sets can be found in the [documentation](https://cddleiden.github.io/QSPRpred/docs/features.html). Note that some descriptors sets may require additional dependencies to be installed in order to be used, see [installation instructions](https://cddleiden.github.io/QSPRpred/docs-dev/install.html) for more information.

### Fingerprints

Fingerprints are a type of molecular descriptor that encode the presence or absence of substructures in a molecule. They are often used in cheminformatics for tasks such as similarity searching and clustering. In QSPRpred, fingerprints can be calculated with the `Fingerprints` descriptor sets. Each bit of the fingerprint is a separate descriptor. In the previous example, we have shown how to calculate the Morgan fingerprints with the `Fingerprints` descriptor set. Another type of fingerprint that can be calculated with QSPRpred is the MACCS keys.

In [10]:
from qsprpred.data.descriptors.fingerprints import MaccsFP

dataset.addDescriptors([MaccsFP()])

qspr_dataset.getDescriptors().head()



Unnamed: 0_level_0,AvgIpc,BCUT2D_CHGHI,BCUT2D_CHGLO,BCUT2D_LOGPHI,BCUT2D_LOGPLOW,BCUT2D_MRHI,BCUT2D_MRLOW,BCUT2D_MWHI,BCUT2D_MWLOW,BalabanJ,...,MACCSFP_157,MACCSFP_158,MACCSFP_159,MACCSFP_160,MACCSFP_161,MACCSFP_162,MACCSFP_163,MACCSFP_164,MACCSFP_165,MACCSFP_166
QSPRID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
DescriptorsTutorialDataset_0000,3.175462,2.146917,-2.111585,2.224211,-2.211801,5.89803,-0.115978,16.34251,10.325111,1.97945,...,False,True,True,True,True,True,True,True,True,False
DescriptorsTutorialDataset_0001,2.962034,2.203517,-2.136048,2.354533,-2.114561,7.208763,-0.384429,32.133541,9.952695,1.63729,...,False,True,True,False,True,True,True,True,True,False
DescriptorsTutorialDataset_0002,3.127957,2.1882,-2.067268,2.189363,-2.20295,6.053281,0.102176,16.153774,10.19078,1.748699,...,False,True,True,False,True,True,True,True,True,False
DescriptorsTutorialDataset_0003,3.491801,2.749014,-2.231626,2.674016,-2.410868,6.301607,-0.140325,35.495693,9.981499,1.448422,...,True,True,True,True,True,True,True,True,True,False
DescriptorsTutorialDataset_0004,3.236332,2.185263,-2.113838,2.178416,-2.404113,7.903051,0.095234,32.227749,10.184907,1.581267,...,True,True,True,True,True,True,True,True,True,False


### Pre-calculated descriptors

In addition to calculating descriptors with QSPRpred, it is also possible to add pre-calculated descriptors to a dataset. This can be done with the `DataFrameDescriptorSet` class. This class takes a pandas DataFrame as input, where the rows correspond to the molecules and the columns correspond to the descriptors. The `DataFrameDescriptorSet` can then be added to a dataset with the `addDescriptors` method.

Note. that with pre-calculated descriptors, the prediction with trained models will be limited to molecules that are present in the pre-calculated descriptor dataframe.

In [11]:
from qsprpred.data.descriptors.sets import DataFrameDescriptorSet
import numpy as np
import pandas as pd

# Create a dataframe with 10 columns of random values with the same index as the dataset
index = dataset.getDF().index
random_descriptors = pd.DataFrame(np.random.rand(len(index), 10),
                                  index=index,
                                  columns=[f"random_{i}" for i in range(10)])
display(random_descriptors.head())

# Create a DataFrameDescriptorSet from the random descriptors
random_descriptor_set = DataFrameDescriptorSet(random_descriptors)
dataset.addDescriptors([random_descriptor_set])
dataset.getDescriptors().head()

Unnamed: 0_level_0,random_0,random_1,random_2,random_3,random_4,random_5,random_6,random_7,random_8,random_9
QSPRID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
DescriptorsTutorialDataset_0000,0.112048,0.800043,0.936078,0.062755,0.640164,0.496888,0.711522,0.024552,0.342392,0.030565
DescriptorsTutorialDataset_0001,0.655816,0.116491,0.986175,0.075547,0.499157,0.663333,0.496793,0.773352,0.893256,0.202822
DescriptorsTutorialDataset_0002,0.158306,0.903855,0.52473,0.189414,0.310671,0.313358,0.317254,0.599103,0.524731,0.363653
DescriptorsTutorialDataset_0003,0.257356,0.939345,0.165602,0.025083,0.607322,0.767123,0.957342,0.398412,0.933713,0.76019
DescriptorsTutorialDataset_0004,0.823825,0.061939,0.439873,0.914717,0.175722,0.769527,0.570256,0.837922,0.642899,0.499636


Unnamed: 0_level_0,AvgIpc,BCUT2D_CHGHI,BCUT2D_CHGLO,BCUT2D_LOGPHI,BCUT2D_LOGPLOW,BCUT2D_MRHI,BCUT2D_MRLOW,BCUT2D_MWHI,BCUT2D_MWLOW,BalabanJ,...,random_0,random_1,random_2,random_3,random_4,random_5,random_6,random_7,random_8,random_9
QSPRID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
DescriptorsTutorialDataset_0000,3.175462,2.146917,-2.111585,2.224211,-2.211801,5.89803,-0.115978,16.34251,10.325111,1.97945,...,0.112048,0.800043,0.936078,0.062755,0.640164,0.496888,0.711522,0.024552,0.342392,0.030565
DescriptorsTutorialDataset_0001,2.962034,2.203517,-2.136048,2.354533,-2.114561,7.208763,-0.384429,32.133541,9.952695,1.63729,...,0.655816,0.116491,0.986175,0.075547,0.499157,0.663333,0.496793,0.773352,0.893256,0.202822
DescriptorsTutorialDataset_0002,3.127957,2.1882,-2.067268,2.189363,-2.20295,6.053281,0.102176,16.153774,10.19078,1.748699,...,0.158306,0.903855,0.52473,0.189414,0.310671,0.313358,0.317254,0.599103,0.524731,0.363653
DescriptorsTutorialDataset_0003,3.491801,2.749014,-2.231626,2.674016,-2.410868,6.301607,-0.140325,35.495693,9.981499,1.448422,...,0.257356,0.939345,0.165602,0.025083,0.607322,0.767123,0.957342,0.398412,0.933713,0.76019
DescriptorsTutorialDataset_0004,3.236332,2.185263,-2.113838,2.178416,-2.404113,7.903051,0.095234,32.227749,10.184907,1.581267,...,0.823825,0.061939,0.439873,0.914717,0.175722,0.769527,0.570256,0.837922,0.642899,0.499636


### Using a trained model to calculate descriptors

In some cases, it may be useful to use a trained model to calculate properties of a molecule that can than be used as descriptors. This can be done with the `PredictorDesc` class. This class takes a trained model as input and uses it to make predictions for a set of molecules. The predictions are then added to the dataset as descriptors with the `addDescriptors` method.

In [12]:
# First we need to create a model to use as descriptor set
from qsprpred.models.scikit_learn import SklearnModel
from sklearn.ensemble import RandomForestRegressor

dataset_for_predictor = QSPRDataset.fromTableFile(
    filename="../../tutorial_data/A2A_LIGANDS.tsv",
    store_dir="../../tutorial_output/data",
    name="DescriptorsTutorialPredictorDataset",
    target_props=[{"name": "pchembl_value_Mean", "task": "REGRESSION"}]
)

dataset_for_predictor.prepareDataset(feature_calculators=[MorganFP(radius=2, nBits=128)])

model = SklearnModel(base_dir="../../tutorial_output/models",
                     name="DescriptorsTutorialModel",
                     alg=RandomForestRegressor)

_ = model.fitDataset(dataset_for_predictor)

In [13]:
# Now we can use the model as a descriptor set
from qsprpred.data.descriptors.sets import PredictorDesc

predictor_desc = PredictorDesc(model)

dataset.addDescriptors([predictor_desc])

dataset.getDescriptors().head()

Unnamed: 0_level_0,AvgIpc,BCUT2D_CHGHI,BCUT2D_CHGLO,BCUT2D_LOGPHI,BCUT2D_LOGPLOW,BCUT2D_MRHI,BCUT2D_MRLOW,BCUT2D_MWHI,BCUT2D_MWLOW,BalabanJ,...,random_1,random_2,random_3,random_4,random_5,random_6,random_7,random_8,random_9,DescriptorsTutorialModel
QSPRID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
DescriptorsTutorialDataset_0000,3.175462,2.146917,-2.111585,2.224211,-2.211801,5.89803,-0.115978,16.34251,10.325111,1.97945,...,0.800043,0.936078,0.062755,0.640164,0.496888,0.711522,0.024552,0.342392,0.030565,8.7179
DescriptorsTutorialDataset_0001,2.962034,2.203517,-2.136048,2.354533,-2.114561,7.208763,-0.384429,32.133541,9.952695,1.63729,...,0.116491,0.986175,0.075547,0.499157,0.663333,0.496793,0.773352,0.893256,0.202822,5.564867
DescriptorsTutorialDataset_0002,3.127957,2.1882,-2.067268,2.189363,-2.20295,6.053281,0.102176,16.153774,10.19078,1.748699,...,0.903855,0.52473,0.189414,0.310671,0.313358,0.317254,0.599103,0.524731,0.363653,6.1108
DescriptorsTutorialDataset_0003,3.491801,2.749014,-2.231626,2.674016,-2.410868,6.301607,-0.140325,35.495693,9.981499,1.448422,...,0.939345,0.165602,0.025083,0.607322,0.767123,0.957342,0.398412,0.933713,0.76019,5.47677
DescriptorsTutorialDataset_0004,3.236332,2.185263,-2.113838,2.178416,-2.404113,7.903051,0.095234,32.227749,10.184907,1.581267,...,0.061939,0.439873,0.914717,0.175722,0.769527,0.570256,0.837922,0.642899,0.499636,5.65615


### Custom descriptors

To implement your own descriptors, see the [custom descriptors](../../advanced/data/custom_descriptors.ipynb) tutorial.