# Data Collection with Papyrus

QSPRpred provides a wrapper `Papyrus` around the [Papyrus](https://jcheminf.biomedcentral.com/articles/10.1186/s13321-022-00672-x) data set and its [associated scripts](https://github.com/OlivierBeq/Papyrus-scripts), this allows you to easily fetch bioactivity data for a particular set of protein targets.
The class fetches the data set and allows to perform efficient filtering and queries for some common tasks. In this tutorial, we will show how to fetch bioactivity data for a particular set of protein targets.

### Fetching Bioactivity Data

We can use the following code to fetch bioactivity data for multiple adenosine receptor subtypes (A2A (P29274), A2B (P29275), A1 (P30542), A3 (P0DMS8)) at once.

In [1]:
from qsprpred.data.sources.papyrus import Papyrus

acc_keys = ["P29274", "P29275", "P30542", "P0DMS8"]
dataset_name = "PapyrusTutorialDataset"  # name of the file to be generated
quality = "high"  # choose minimum quality from {"high", "medium", "low"}
papyrus_version = "05.6"  # Papyrus database version
data_dir = "../../tutorial_data/papyrus"  # directory to store the Papyrus data
output_dir = "../../tutorial_output/data"  # directory to store the generated dataset

# Create a Papyrus object, which specifies the version and directory to store the payrus data
papyrus = Papyrus(
    data_dir=data_dir,
    version=papyrus_version,
    stereo=False,
    plus_only=True,
)

# Create subset of payrus data for the given accession keys, returns a MoleculeTable
mt = papyrus.getData(
    dataset_name,
    acc_keys,
    quality,
    output_dir=output_dir,
    use_existing=True,
    activity_types=["Ki", "IC50", "Kd"]
)
mt.getDF().head()

########## DISCLAIMER ##########
You are downloading the high-quality Papyrus++ dataset.
Should you want to access the entire, though of lower quality, Papyrus dataset,
look into additional switches of this command.
################################
Number of files to be downloaded: 3
Total size: 33.0MB


Downloading version 05.6:   0%|          | 0.00/33.0M [00:00<?, ?B/s]

0it [00:00, ?it/s]

Unnamed: 0_level_0,Activity_ID,Quality,source,CID,SMILES,connectivity,InChIKey,InChI,InChI_AuxInfo,target_id,...,pchembl_value,pchembl_value_Mean,pchembl_value_StdDev,pchembl_value_SEM,pchembl_value_N,pchembl_value_Median,pchembl_value_MAD,original_smiles,ID,ID_before_change
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
PapyrusTutorialDataset_storage_library_00000,AACWUFIIMOHGSO_on_P29274_WT,High,ChEMBL31,ChEMBL31.compound.91968,Cc1nn(-c2cc(NC(=O)CCN(C)C)nc(-c3ccc(C)o3)n2)c(...,AACWUFIIMOHGSO,AACWUFIIMOHGSO-UHFFFAOYSA-N,InChI=1S/C19H24N6O2/c1-12-10-13(2)25(23-12)17-...,"""AuxInfo=1/1/N:1,26,22,14,15,20,19,11,12,27,6,...",P29274_WT,...,8.68,8.68,0.0,0.0,1.0,8.68,0.0,Cc1nn(-c2cc(NC(=O)CCN(C)C)nc(-c3ccc(C)o3)n2)c(...,PapyrusTutorialDataset_storage_library_00000,PapyrusTutorialDataset_storage_library_00000
PapyrusTutorialDataset_storage_library_00001,AACWUFIIMOHGSO_on_P30542_WT,High,ChEMBL31,ChEMBL31.compound.91968,Cc1nn(-c2cc(NC(=O)CCN(C)C)nc(-c3ccc(C)o3)n2)c(...,AACWUFIIMOHGSO,AACWUFIIMOHGSO-UHFFFAOYSA-N,InChI=1S/C19H24N6O2/c1-12-10-13(2)25(23-12)17-...,"""AuxInfo=1/1/N:1,26,22,14,15,20,19,11,12,27,6,...",P30542_WT,...,6.68,6.68,0.0,0.0,1.0,6.68,0.0,Cc1nn(-c2cc(NC(=O)CCN(C)C)nc(-c3ccc(C)o3)n2)c(...,PapyrusTutorialDataset_storage_library_00001,PapyrusTutorialDataset_storage_library_00001
PapyrusTutorialDataset_storage_library_00002,AAEYTMMNWWKSKZ_on_P29274_WT,High,ChEMBL31,ChEMBL31.compound.131451,Nc1c(C(=O)Nc2ccc([N+](=O)[O-])cc2)sc2c1cc1CCCC...,AAEYTMMNWWKSKZ,AAEYTMMNWWKSKZ-UHFFFAOYSA-N,InChI=1S/C18H16N4O3S/c19-15-13-9-10-3-1-2-4-14...,"""AuxInfo=1/1/N:22,23,21,24,8,15,9,14,19,20,7,1...",P29274_WT,...,4.82,4.82,0.0,0.0,1.0,4.82,0.0,Nc1c(C(=O)Nc2ccc([N+](=O)[O-])cc2)sc2c1cc1CCCC...,PapyrusTutorialDataset_storage_library_00002,PapyrusTutorialDataset_storage_library_00002
PapyrusTutorialDataset_storage_library_00003,AAGFKZWKWAMJNP_on_P0DMS8_WT,High,ChEMBL31,ChEMBL31.compound.100375,O=C(Nc1nc2ncccc2n2c(=O)n(-c3ccccc3)nc12)c1ccccc1,AAGFKZWKWAMJNP,AAGFKZWKWAMJNP-UHFFFAOYSA-N,InChI=1S/C21H14N6O2/c28-20(14-8-3-1-4-9-14)24-...,"""AuxInfo=1/1/N:27,19,26,28,18,20,9,25,29,17,21...",P0DMS8_WT,...,7.15,7.15,0.0,0.0,1.0,7.15,0.0,O=C(Nc1nc2ncccc2n2c(=O)n(-c3ccccc3)nc12)c1ccccc1,PapyrusTutorialDataset_storage_library_00003,PapyrusTutorialDataset_storage_library_00003
PapyrusTutorialDataset_storage_library_00004,AAGFKZWKWAMJNP_on_P29274_WT,High,ChEMBL31,ChEMBL31.compound.100375,O=C(Nc1nc2ncccc2n2c(=O)n(-c3ccccc3)nc12)c1ccccc1,AAGFKZWKWAMJNP,AAGFKZWKWAMJNP-UHFFFAOYSA-N,InChI=1S/C21H14N6O2/c28-20(14-8-3-1-4-9-14)24-...,"""AuxInfo=1/1/N:27,19,26,28,18,20,9,25,29,17,21...",P29274_WT,...,5.65,5.65,0.0,0.0,1.0,5.65,0.0,O=C(Nc1nc2ncccc2n2c(=O)n(-c3ccccc3)nc12)c1ccccc1,PapyrusTutorialDataset_storage_library_00004,PapyrusTutorialDataset_storage_library_00004


By default, the method returns a `MoleculeTable` so if you want to turn it into a `QSPRDataset` for modelling, you have to use the `fromMolTable` helper method. See the [data representation tutorial](data_representation.ipynb) for more details.


In [2]:
from qsprpred import TargetTasks
from qsprpred.data import QSPRDataset

target_props = [
    {"name": "pchembl_value_Median", "task": TargetTasks.SINGLECLASS, "th": [6.5]}]
ds = QSPRDataset.fromMolTable(mt, target_props=target_props)
ds.targetProperties

[TargetProperty(name=pchembl_value_Median, task=SINGLECLASS, th=[6.5])]