# Data Preparation - Advanced Features

In this notebook, we will explore the advanced features of the data preparation modules. We will start with more advanced data integration options with the `Papyrus` class and then move on to calculating protein descriptors and working with molecule scaffolds.

## Interfacing with Papyrus

The `Papyrus` class is a wrapper around the [Papyrus data set](https://jcheminf.biomedcentral.com/articles/10.1186/s13321-022-00672-x) and its [associated scripts](https://github.com/OlivierBeq/Papyrus-scripts). The class fetches the data set and allows to perform efficient filtering and queries for some common tasks. In this tutorial, we will show how to fetch bioactivity data for a particular set of protein targets as well as the protein sequences that will be later used to calculate protein descriptors for PCM models.

### Fetching Bioactivity Data

We can use the following code to fetch bioactivity data for multiple adenosine receptor subtypes at once:

In [1]:
from qsprpred.data.sources.papyrus import Papyrus

acc_keys = ["P29274", "P29275", "P30542", "P0DMS8"]
dataset_name = "AR_LIGANDS"  # name of the file to be generated
quality = "high"  # choose minimum quality from {"high", "medium", "low"}
papyrus_version = '05.6'  # Papyrus database version
data_dir = "data"

papyrus = Papyrus(
    data_dir=data_dir,
    stereo=False,
    version=papyrus_version
)

mt = papyrus.getData(
    acc_keys,
    quality,
    name=dataset_name,
    use_existing=True,
    activity_types=["Ki", "IC50", "Kd"]
)
mt

You are downloading the high-quality Papyrus++ dataset.
Should you want to access the entire, though of lower quality, Papyrus dataset,
look into additional switches of this command.
Number of files to be downloaded: 3
Total size: 33.0MB


Downloading version 05.6:   0%|          | 0.00/33.0M [00:00<?, ?B/s]

Using existing data from data/AR_LIGANDS.tsv...


<qsprpred.data.data.MoleculeTable at 0x7f0bc574c730>

In [2]:
mt.getDF()

Unnamed: 0_level_0,Activity_ID,Quality,source,CID,SMILES,connectivity,InChIKey,InChI,InChI_AuxInfo,target_id,...,Activity_class,relation,pchembl_value,pchembl_value_Mean,pchembl_value_StdDev,pchembl_value_SEM,pchembl_value_N,pchembl_value_Median,pchembl_value_MAD,QSPRID
QSPRID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
AR_LIGANDS_0,AACWUFIIMOHGSO_on_P29274_WT,High,ChEMBL31,ChEMBL31.compound.91968,Cc1nn(-c2cc(NC(=O)CCN(C)C)nc(-c3ccc(C)o3)n2)c(...,AACWUFIIMOHGSO,AACWUFIIMOHGSO-UHFFFAOYSA-N,InChI=1S/C19H24N6O2/c1-12-10-13(2)25(23-12)17-...,"""AuxInfo=1/1/N:1,26,22,14,15,20,19,11,12,27,6,...",P29274_WT,...,,=,8.68,8.680,0.000000,0.000000,1.0,8.680,0.000,AR_LIGANDS_0
AR_LIGANDS_1,AACWUFIIMOHGSO_on_P30542_WT,High,ChEMBL31,ChEMBL31.compound.91968,Cc1nn(-c2cc(NC(=O)CCN(C)C)nc(-c3ccc(C)o3)n2)c(...,AACWUFIIMOHGSO,AACWUFIIMOHGSO-UHFFFAOYSA-N,InChI=1S/C19H24N6O2/c1-12-10-13(2)25(23-12)17-...,"""AuxInfo=1/1/N:1,26,22,14,15,20,19,11,12,27,6,...",P30542_WT,...,,=,6.68,6.680,0.000000,0.000000,1.0,6.680,0.000,AR_LIGANDS_1
AR_LIGANDS_2,AAEYTMMNWWKSKZ_on_P29274_WT,High,ChEMBL31,ChEMBL31.compound.131451,Nc1c(C(=O)Nc2ccc([N+](=O)[O-])cc2)sc2c1cc1CCCC...,AAEYTMMNWWKSKZ,AAEYTMMNWWKSKZ-UHFFFAOYSA-N,InChI=1S/C18H16N4O3S/c19-15-13-9-10-3-1-2-4-14...,"""AuxInfo=1/1/N:22,23,21,24,8,15,9,14,19,20,7,1...",P29274_WT,...,,=,4.82,4.820,0.000000,0.000000,1.0,4.820,0.000,AR_LIGANDS_2
AR_LIGANDS_3,AAGFKZWKWAMJNP_on_P0DMS8_WT,High,ChEMBL31,ChEMBL31.compound.100375,O=C(Nc1nc2ncccc2n2c(=O)n(-c3ccccc3)nc12)c1ccccc1,AAGFKZWKWAMJNP,AAGFKZWKWAMJNP-UHFFFAOYSA-N,InChI=1S/C21H14N6O2/c28-20(14-8-3-1-4-9-14)24-...,"""AuxInfo=1/1/N:27,19,26,28,18,20,9,25,29,17,21...",P0DMS8_WT,...,,=,7.15,7.150,0.000000,0.000000,1.0,7.150,0.000,AR_LIGANDS_3
AR_LIGANDS_4,AAGFKZWKWAMJNP_on_P29274_WT,High,ChEMBL31,ChEMBL31.compound.100375,O=C(Nc1nc2ncccc2n2c(=O)n(-c3ccccc3)nc12)c1ccccc1,AAGFKZWKWAMJNP,AAGFKZWKWAMJNP-UHFFFAOYSA-N,InChI=1S/C21H14N6O2/c28-20(14-8-3-1-4-9-14)24-...,"""AuxInfo=1/1/N:27,19,26,28,18,20,9,25,29,17,21...",P29274_WT,...,,=,5.65,5.650,0.000000,0.000000,1.0,5.650,0.000,AR_LIGANDS_4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
AR_LIGANDS_12435,ZXPDGTGMZKIESV_on_P29274_WT,High,ChEMBL31,ChEMBL31.compound.49348;ChEMBL31.compound.413675,CNC(=O)C12CC1C(n1cnc3c1nc(Cl)nc3NC1CCCC1)C(O)C2O,ZXPDGTGMZKIESV,ZXPDGTGMZKIESV-UHFFFAOYSA-N,InChI=1S/C18H23ClN6O3/c1-20-16(28)18-6-9(18)11...,"""AuxInfo=1/1/N:1,22,23,21,24,6,10,20,7,12,8,25...",P29274_WT,...,,=,5.49;5.49,5.490,0.000000,0.000000,2.0,5.490,0.000,AR_LIGANDS_12435
AR_LIGANDS_12436,ZXPDGTGMZKIESV_on_P30542_WT,High,ChEMBL31,ChEMBL31.compound.49348;ChEMBL31.compound.413675,CNC(=O)C12CC1C(n1cnc3c1nc(Cl)nc3NC1CCCC1)C(O)C2O,ZXPDGTGMZKIESV,ZXPDGTGMZKIESV-UHFFFAOYSA-N,InChI=1S/C18H23ClN6O3/c1-20-16(28)18-6-9(18)11...,"""AuxInfo=1/1/N:1,22,23,21,24,6,10,20,7,12,8,25...",P30542_WT,...,,=,7.74;7.74,7.740,0.000000,0.000000,2.0,7.740,0.000,AR_LIGANDS_12436
AR_LIGANDS_12437,ZYEXHNHGCDESIU_on_P0DMS8_WT,High,ChEMBL31,ChEMBL31.compound.1825,Nc1c2nc(C#Cc3ccccc3)n(C3OC(CO)C(O)C3O)c2ncn1,ZYEXHNHGCDESIU,ZYEXHNHGCDESIU-UHFFFAOYSA-N,InChI=1S/C18H17N5O4/c19-16-13-17(21-9-20-16)23...,"""AuxInfo=1/1/N:11,10,12,9,13,7,6,18,26,8,17,5,...",P0DMS8_WT,...,,=,6.07;6.1,6.085,0.021213,0.015000,2.0,6.085,0.015,AR_LIGANDS_12437
AR_LIGANDS_12438,ZYQMTELMKXLVHN_on_P0DMS8_WT,High,ChEMBL31,ChEMBL31.compound.18654,CCCn1cc2c(n1)nc(NC(=O)Nc1ccccc1OC)n1nc(-c3ccco...,ZYQMTELMKXLVHN,ZYQMTELMKXLVHN-UHFFFAOYSA-N,InChI=1S/C21H20N8O3/c1-3-10-28-12-13-17(26-28)...,"""AuxInfo=1/1/N:1,22,2,17,18,28,16,19,27,3,29,5...",P0DMS8_WT,...,,=,6.47;9.47;9.47,8.470,1.732051,1.000000,3.0,9.470,0.000,AR_LIGANDS_12438


By default, the method returns a `MoleculeTable` so if you want to turn it into a `QSPRDataset` for modeling, you have to use the `fromMolTable` helper method:

In [3]:
from qsprpred.models.tasks import TargetTasks
from qsprpred.data.data import QSPRDataset

target_props=[{"name": "pchembl_value_Median", "task": TargetTasks.SINGLECLASS, "th": [6.5]}]
ds = QSPRDataset.fromMolTable(mt, target_props=target_props)
ds.targetProperties

[TargetProperty(name=pchembl_value_Median_class, task=SINGLECLASS, th=[6.5])]

### Fetching Protein Data

In addition, it is also possible to easily fetch the sequences for our proteins:

In [4]:
ds_seq = papyrus.getProteinData(acc_keys, name=f"{ds.name}_seqs", use_existing=True)
ds_seq

You are downloading the high-quality Papyrus++ dataset.
Should you want to access the entire, though of lower quality, Papyrus dataset,
look into additional switches of this command.
Number of files to be downloaded: 3
Total size: 33.0MB


Downloading version 05.6:   0%|          | 0.00/33.0M [00:00<?, ?B/s]

Unnamed: 0,target_id,HGNC_symbol,UniProtID,Status,Organism,Classification,Length,Sequence,TID,accession
0,P29275_WT,ADORA2B,AA2BR_HUMAN,reviewed,Homo sapiens (Human),Membrane receptor->Family A G protein-coupled ...,332,MLLETQDALYVALELVIAALSVAGNVLVCAAVGTANTLQTPTNYFL...,ChEMBL:CHEMBL255;ChEMBL:CHEMBL255;ChEMBL:CHEMB...,P29275
1,P30542_WT,ADORA1,AA1R_HUMAN,reviewed,Homo sapiens (Human),Membrane receptor->Family A G protein-coupled ...,326,MPPSISAFQAAYIGIEVLIALVSVPGNVLVIWAVKVNQALRDATFC...,ChEMBL:CHEMBL226;ChEMBL:CHEMBL226;ChEMBL:CHEMB...,P30542
2,P29274_WT,ADORA2A,AA2AR_HUMAN,reviewed,Homo sapiens (Human),Membrane receptor->Family A G protein-coupled ...,412,MPIMGSSVYITVELAIAVLAILGNVLVCWAVWLNSNLQNVTNYFVV...,ChEMBL:CHEMBL251;ChEMBL:CHEMBL251;ChEMBL:CHEMB...,P29274
3,P0DMS8_WT,ADORA3,AA3R_HUMAN,reviewed,Homo sapiens (Human),Membrane receptor->Family A G protein-coupled ...,318,MPNNSTALSLANVTYITMEIFIGLCAIVGNVLVICVVKLNPSLQTT...,ChEMBL:CHEMBL256;ChEMBL:CHEMBL256;ChEMBL:CHEMB...,P0DMS8


The  keys used to fetch the data are saved in the `accession` colum in the resulting data frame:

In [5]:
ds_seq.accession

0    P29275
1    P30542
2    P29274
3    P0DMS8
Name: accession, dtype: object

We can find the same column in our data set as well so it is possible to match sequences to bioactivity data now:

In [6]:
ds.getSubset("accession")

Unnamed: 0_level_0,accession
QSPRID,Unnamed: 1_level_1
AR_LIGANDS_0,P29274
AR_LIGANDS_1,P30542
AR_LIGANDS_2,P29274
AR_LIGANDS_3,P0DMS8
AR_LIGANDS_4,P29274
...,...
AR_LIGANDS_12435,P29274
AR_LIGANDS_12436,P30542
AR_LIGANDS_12437,P0DMS8
AR_LIGANDS_12438,P0DMS8


## Calculating Protein Descriptors

In this section, we will show how to connect the information about sequences with our data set and calculate protein descriptors using multiple sequence alignment and the `PCMDataSet` class from the `qsprpred.extra` package. First, let us convert the original data set saved in the `ds` variable to a `PCMDataSet`:

In [7]:
from qsprpred.extra.data.data import PCMDataSet

def sequence_provider(acc_keys):
    """
    A function that provides a mapping from accession key to a protein sequence.

    Args:
        acc_keys (list): Accession keys of the protein to get a sequences for.

    Returns:
        (dict) : Mapping of accession keys to protein sequences.
        (dict) : Additional information to pass to the MSA provider (can be empty).
    """
    map = dict()
    info = dict()
    for i, row in ds_seq.iterrows():
        map[row['accession']] = row['Sequence']

        # can be omitted
        info[row['accession']] = {
            'Organism': row['Organism'],
            'UniProtID': row['UniProtID'],
        }

    return map, info

ds = PCMDataSet.fromMolTable(
    ds,
    name=ds.name,
    protein_col="accession",
    protein_seq_provider=sequence_provider,
    store_dir="./data",
    target_props=ds.targetProperties
)
ds

<qsprpred.extra.data.data.PCMDataSet at 0x7f0b15fdafe0>

`PCMDataset` knows how to connect accession keys to sequences thanks to `proteincol` and `proteinseqprovider`, and it will automatically create a multiple sequence alignment (MSA) for us when calculating protein descriptors, which is facilitated through the `addProteinDescriptors` method and a `ProteinDescriptorCalculator`:

In [8]:
from qsprpred.extra.data.utils.descriptorcalculator import ProteinDescriptorCalculator
from qsprpred.extra.data.utils.descriptorsets import ProDec
from qsprpred.extra.data.utils.descriptor_utils.msa_calculator import ClustalMSA

calc = ProteinDescriptorCalculator(
    desc_sets=[ProDec(sets=["Zscale Hellberg"])],
    msa_provider=ClustalMSA(out_dir=ds.storeDir)
)
ds.addProteinDescriptors(calc)

  0%|          | 0/4 [00:00<?, ?it/s]

We can check the descriptor matrix:

In [9]:
ds.getDescriptors()

Unnamed: 0_level_0,Descriptor_PCM_ProDec_Zscale_1,Descriptor_PCM_ProDec_Zscale_2,Descriptor_PCM_ProDec_Zscale_3,Descriptor_PCM_ProDec_Zscale_4,Descriptor_PCM_ProDec_Zscale_5,Descriptor_PCM_ProDec_Zscale_6,Descriptor_PCM_ProDec_Zscale_7,Descriptor_PCM_ProDec_Zscale_8,Descriptor_PCM_ProDec_Zscale_9,Descriptor_PCM_ProDec_Zscale_10,...,Descriptor_PCM_ProDec_Zscale_1269,Descriptor_PCM_ProDec_Zscale_1270,Descriptor_PCM_ProDec_Zscale_1271,Descriptor_PCM_ProDec_Zscale_1272,Descriptor_PCM_ProDec_Zscale_1273,Descriptor_PCM_ProDec_Zscale_1274,Descriptor_PCM_ProDec_Zscale_1275,Descriptor_PCM_ProDec_Zscale_1276,Descriptor_PCM_ProDec_Zscale_1277,Descriptor_PCM_ProDec_Zscale_1278
QSPRID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
AR_LIGANDS_0,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,...,0.09,2.23,-5.36,0.3,-2.69,-2.53,-1.29,1.96,-1.63,0.57
AR_LIGANDS_1,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,-2.49,...,0.00,0.00,0.00,0.0,0.00,0.00,0.00,0.00,0.00,0.00
AR_LIGANDS_2,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,...,0.09,2.23,-5.36,0.3,-2.69,-2.53,-1.29,1.96,-1.63,0.57
AR_LIGANDS_3,-2.49,-0.27,-0.41,-1.22,0.88,2.23,3.22,1.45,0.84,3.22,...,0.00,0.00,0.00,0.0,0.00,0.00,0.00,0.00,0.00,0.00
AR_LIGANDS_4,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,...,0.09,2.23,-5.36,0.3,-2.69,-2.53,-1.29,1.96,-1.63,0.57
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
AR_LIGANDS_12435,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,...,0.09,2.23,-5.36,0.3,-2.69,-2.53,-1.29,1.96,-1.63,0.57
AR_LIGANDS_12436,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,-2.49,...,0.00,0.00,0.00,0.0,0.00,0.00,0.00,0.00,0.00,0.00
AR_LIGANDS_12437,-2.49,-0.27,-0.41,-1.22,0.88,2.23,3.22,1.45,0.84,3.22,...,0.00,0.00,0.00,0.0,0.00,0.00,0.00,0.00,0.00,0.00
AR_LIGANDS_12438,-2.49,-0.27,-0.41,-1.22,0.88,2.23,3.22,1.45,0.84,3.22,...,0.00,0.00,0.00,0.0,0.00,0.00,0.00,0.00,0.00,0.00


We can of course combine this with molecular descriptors as well:

In [10]:
from qsprpred.data.utils.descriptorsets import RDKitDescs
from qsprpred.data.utils.descriptorcalculator import MoleculeDescriptorsCalculator

ds.nJobs = 12 # speeding things up a little
calc = MoleculeDescriptorsCalculator(desc_sets=[RDKitDescs()])
ds.addDescriptors(calc)
ds.getDescriptors()

Parallel apply in progress for AR_LIGANDS.:   0%|          | 0/21 [00:00<?, ?it/s]

Unnamed: 0_level_0,Descriptor_PCM_ProDec_Zscale_1,Descriptor_PCM_ProDec_Zscale_2,Descriptor_PCM_ProDec_Zscale_3,Descriptor_PCM_ProDec_Zscale_4,Descriptor_PCM_ProDec_Zscale_5,Descriptor_PCM_ProDec_Zscale_6,Descriptor_PCM_ProDec_Zscale_7,Descriptor_PCM_ProDec_Zscale_8,Descriptor_PCM_ProDec_Zscale_9,Descriptor_PCM_ProDec_Zscale_10,...,Descriptor_RDkit_fr_sulfide,Descriptor_RDkit_fr_sulfonamd,Descriptor_RDkit_fr_sulfone,Descriptor_RDkit_fr_term_acetylene,Descriptor_RDkit_fr_tetrazole,Descriptor_RDkit_fr_thiazole,Descriptor_RDkit_fr_thiocyan,Descriptor_RDkit_fr_thiophene,Descriptor_RDkit_fr_unbrch_alkane,Descriptor_RDkit_fr_urea
QSPRID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
AR_LIGANDS_0,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
AR_LIGANDS_1,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,-2.49,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
AR_LIGANDS_2,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
AR_LIGANDS_3,-2.49,-0.27,-0.41,-1.22,0.88,2.23,3.22,1.45,0.84,3.22,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
AR_LIGANDS_4,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
AR_LIGANDS_12435,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
AR_LIGANDS_12436,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,-2.49,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
AR_LIGANDS_12437,-2.49,-0.27,-0.41,-1.22,0.88,2.23,3.22,1.45,0.84,3.22,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
AR_LIGANDS_12438,-2.49,-0.27,-0.41,-1.22,0.88,2.23,3.22,1.45,0.84,3.22,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0


## Adding Precalculated Descriptors

It is also possible to add custom precalculated descriptors. We just need to use the appropriate method and calculator. Let's consider a hypothetical example of adding the `Year` column as a descriptor:

In [11]:
from qsprpred.data.utils.descriptorcalculator import CustomDescriptorsCalculator
from qsprpred.data.utils.descriptorsets import DataFrameDescriptorSet

calc = CustomDescriptorsCalculator(desc_sets=[
    DataFrameDescriptorSet(ds.getDF()[["Year"]]) # just an example, not useful as a real descriptor
])
ds.addCustomDescriptors(calc)
ds.getDescriptors()

Unnamed: 0_level_0,Descriptor_PCM_ProDec_Zscale_1,Descriptor_PCM_ProDec_Zscale_2,Descriptor_PCM_ProDec_Zscale_3,Descriptor_PCM_ProDec_Zscale_4,Descriptor_PCM_ProDec_Zscale_5,Descriptor_PCM_ProDec_Zscale_6,Descriptor_PCM_ProDec_Zscale_7,Descriptor_PCM_ProDec_Zscale_8,Descriptor_PCM_ProDec_Zscale_9,Descriptor_PCM_ProDec_Zscale_10,...,Descriptor_RDkit_fr_sulfonamd,Descriptor_RDkit_fr_sulfone,Descriptor_RDkit_fr_term_acetylene,Descriptor_RDkit_fr_tetrazole,Descriptor_RDkit_fr_thiazole,Descriptor_RDkit_fr_thiocyan,Descriptor_RDkit_fr_thiophene,Descriptor_RDkit_fr_unbrch_alkane,Descriptor_RDkit_fr_urea,Descriptor_Custom_DataFrame_Year
QSPRID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
AR_LIGANDS_0,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2008.0
AR_LIGANDS_1,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,-2.49,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2008.0
AR_LIGANDS_2,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,2010.0
AR_LIGANDS_3,-2.49,-0.27,-0.41,-1.22,0.88,2.23,3.22,1.45,0.84,3.22,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2009.0
AR_LIGANDS_4,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2009.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
AR_LIGANDS_12435,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2005.0
AR_LIGANDS_12436,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,-2.49,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2005.0
AR_LIGANDS_12437,-2.49,-0.27,-0.41,-1.22,0.88,2.23,3.22,1.45,0.84,3.22,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2001.0
AR_LIGANDS_12438,-2.49,-0.27,-0.41,-1.22,0.88,2.23,3.22,1.45,0.84,3.22,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,2002.0


The only condition on the data frame passed to the `DataFrameDescriptorSet` is that it has the same index as the underlying data frame wrapped in the `MoleculeTable` or `QSPRDataset` we want to add the descriptors to.

**WARNING:** Adding custom descriptors is tricky when you want to use the trained models for prediction. The model will not know how to calculate these descriptors for unknown compounds. So the use of this functionality should be limited to data set preparation or model evaluation on known sets only.