# Data Collection with Papyrus

QSPRpred provides a wrapper `Papyrus` around the [Papyrus](https://jcheminf.biomedcentral.com/articles/10.1186/s13321-022-00672-x) data set and its [associated scripts](https://github.com/OlivierBeq/Papyrus-scripts), this allows you to easily fetch bioactivity data for a particular set of protein targets.
The class fetches the data set and allows to perform efficient filtering and queries for some common tasks. In this tutorial, we will show how to fetch bioactivity data for a particular set of protein targets.

### Fetching Bioactivity Data

We can use the following code to fetch bioactivity data for multiple adenosine receptor subtypes (A2A (P29274), A2B (P29275), A1 (P30542), A3 (P0DMS8)) at once.

In [1]:
from qsprpred.data.sources.papyrus import Papyrus

acc_keys = ["Q12809"]
dataset_name = "hERG_Dataset"  # name of the file to be generated
quality = "high"  # choose minimum quality from {"high", "medium", "low"}
papyrus_version = "05.6"  # Papyrus database version
data_dir = "../../tutorial_data"  # directory to store the Papyrus data
output_dir = "../../tutorial_output/data"  # directory to store the generated dataset

# Create a Papyrus object, which specifies the version and directory to store the payrus data
papyrus = Papyrus(
    data_dir=data_dir,
    version=papyrus_version,
    stereo=False,
    plus_only=True,
)

# Create subset of payrus data for the given accession keys, returns a MoleculeTable
mt = papyrus.getData(
    dataset_name,
    acc_keys,
    quality,
    output_dir=output_dir,
    use_existing=True,
    activity_types=["Ki", "IC50", "Kd"]
)
mt.getDF().head()

OSError: Papyrus data not available (did you download it first?)

By default, the method returns a `MoleculeTable` so if you want to turn it into a `QSPRDataset` for modelling, you have to use the `fromMolTable` helper method. See the [data representation tutorial](data_representation.ipynb) for more details.


In [7]:
from qsprpred import TargetTasks
from qsprpred.data import QSPRDataset

target_props = [
    {"name": "pchembl_value_Median", "task": TargetTasks.SINGLECLASS, "th": [6.5]}]
ds = QSPRDataset.fromMolTable(mt, target_props=target_props)
ds.targetProperties

[TargetProperty(name=pchembl_value_Median, task=SINGLECLASS, th=[6.5])]