In [1]:
#@markdown Setup dependencies (Colab only)
%%capture
try:
    import papyrus_scripts
except:
    !pip uninstall papyrus-scripts -y
    !pip install rdkit-pypi
    !pip install https://github.com/OlivierBeq/Papyrus-scripts/tarball/master --no-cache-dir
    get_ipython().kernel.do_shutdown(True)

# Match PDB data with the Papyrus dataset

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/OlivierBeq/Papyrus-scripts/blob/master/notebook_examples/matchRCSB.ipynb)

The goal of this notebook is to have an overview of the required input parameters to scrape the RCSB database and match those to ligands in the papyrus dataset.

The output is a seperate tsv file with the matched entries.

It is also possible to import the modules and run this code snippet in your own workflow.

## Simulate data to match

Let's first dowload the Papyrus data.

In [2]:
from papyrus_scripts.download import download_papyrus

We will only consider the tabulated data and omit molecular structures and protein and molecular descriptors.

In [3]:
download_papyrus(version='latest', descriptors=None)

Latest version: 05.5
Number of files to be donwloaded: 6
Total size: 721MB


Donwloading version 05.5:   0%|          | 0.00/721M [00:00<?, ?B/s]

Let's now keep only the data associated to the human setotonin receptor (accession P31645).

In [4]:
from papyrus_scripts.reader import read_papyrus
from papyrus_scripts.preprocess import (keep_quality, keep_type, keep_accession,
                                        consume_chunks
                                       )

In [5]:
sample_data = read_papyrus(is3d=False, chunksize=1000000, source_path=None)
filter1_it = keep_accession(sample_data, 'P31645')
filter2_it = keep_quality(data=filter1_it, min_quality='medium')
filter3_it = keep_type(data=filter2_it, activity_types=['Ki', 'KD'])
filtered_data = consume_chunks(filter3_it, progress=True, total=60)

  0%|          | 0/60 [00:00<?, ?it/s]

## Matching the PDB

In [6]:
from papyrus_scripts.matchRCSB import get_matches

The following function call will:
 
1.   update the local copy of PDB identifiers (if not recent enough),
2.   and match the procided data with the PDB.



Two new columns are added to the input dataframe: *PDBID_ligand* and *PDBID_protein*.

In [7]:
PDB_matches = get_matches(filtered_data, verbose=True)
PDB_matches

Obtaining RCSB compound mappings from InChI to PDB ID


Converting InChIs:   0%|          | 0/37103 [00:00<?, ?it/s]

Obtaining RCSB compound mappings from ligand PDB Id to protein PDB ID
Combining the data
Obtaining mappings from protein PDB ID to UniProt accessions
Combining the data
Writing results to disk


Unnamed: 0,Activity_ID,Quality,source,CID,SMILES,connectivity,InChIKey,InChI,InChI_AuxInfo,target_id,...,relation,pchembl_value,pchembl_value_Mean,pchembl_value_StdDev,pchembl_value_SEM,pchembl_value_N,pchembl_value_Median,pchembl_value_MAD,PDBID_ligand,PDBID_protein
0,AHOUBRCZNHFOSL_on_P31645_WT,High,ChEMBL30,CHEMBL490,Fc1ccc(C2CCNCC2COc2ccc3OCOc3c2)cc1,AHOUBRCZNHFOSL,AHOUBRCZNHFOSL-UHFFFAOYSA-N,InChI=1S/C19H20FNO3/c20-15-3-1-13(2-4-15)17-7-...,"""AuxInfo=1/0/N:4,23,3,24,15,16,7,8,22,10,12,19...",P31645_WT,...,=,10.1;10.4;10.05;9.38;9.42;9.96,9.885,0.403869,0.164879,6.0,10.005,0.363237,8PR,6dzw
1,AHOUBRCZNHFOSL_on_P31645_WT,High,ChEMBL30,CHEMBL490,Fc1ccc(C2CCNCC2COc2ccc3OCOc3c2)cc1,AHOUBRCZNHFOSL,AHOUBRCZNHFOSL-UHFFFAOYSA-N,InChI=1S/C19H20FNO3/c20-15-3-1-13(2-4-15)17-7-...,"""AuxInfo=1/0/N:4,23,3,24,15,16,7,8,22,10,12,19...",P31645_WT,...,=,10.1;10.4;10.05;9.38;9.42;9.96,9.885,0.403869,0.164879,6.0,10.005,0.363237,8PR,6awn
2,AHOUBRCZNHFOSL_on_P31645_WT,High,ChEMBL30,CHEMBL490,Fc1ccc(C2CCNCC2COc2ccc3OCOc3c2)cc1,AHOUBRCZNHFOSL,AHOUBRCZNHFOSL-UHFFFAOYSA-N,InChI=1S/C19H20FNO3/c20-15-3-1-13(2-4-15)17-7-...,"""AuxInfo=1/0/N:4,23,3,24,15,16,7,8,22,10,12,19...",P31645_WT,...,=,10.1;10.4;10.05;9.38;9.42;9.96,9.885,0.403869,0.164879,6.0,10.005,0.363237,8PR,6vrh
3,AHOUBRCZNHFOSL_on_P31645_WT,High,ChEMBL30,CHEMBL490,Fc1ccc(C2CCNCC2COc2ccc3OCOc3c2)cc1,AHOUBRCZNHFOSL,AHOUBRCZNHFOSL-UHFFFAOYSA-N,InChI=1S/C19H20FNO3/c20-15-3-1-13(2-4-15)17-7-...,"""AuxInfo=1/0/N:4,23,3,24,15,16,7,8,22,10,12,19...",P31645_WT,...,=,10.1;10.4;10.05;9.38;9.42;9.96,9.885,0.403869,0.164879,6.0,10.005,0.363237,8PR,5i6x
4,BCGWQEUPMDMJNV_on_P31645_WT,High,ChEMBL30,CHEMBL11,CN(C)CCCN1c2c(cccc2)CCc2c1cccc2,BCGWQEUPMDMJNV,BCGWQEUPMDMJNV-UHFFFAOYSA-N,InChI=1S/C19H24N2/c1-20(2)14-7-15-21-18-10-5-3...,"""AuxInfo=1/0/N:1,3,11,20,12,19,5,10,21,13,18,1...",P31645_WT,...,=,8.24;5.89;8.72;7.7;7.51;9.0;9.28;8.7;9.11;8.6;...,8.326667,0.929353,0.268281,12.0,8.64,0.563388,IXX,7lwd
5,SGEGOXDYSFKCPT_on_P31645_WT,High,ChEMBL30,CHEMBL439849,N#Cc1ccc2[nH]cc(CCCCN3CCN(c4ccc5oc(C(N)=O)cc5c...,SGEGOXDYSFKCPT,SGEGOXDYSFKCPT-UHFFFAOYSA-N,InChI=1S/C26H27N5O2/c27-16-18-4-6-23-22(13-18)...,"""AuxInfo=1/1/N:11,12,10,4,19,5,20,13,15,31,16,...",P31645_WT,...,=,8.72;9.3,9.01,0.410122,0.29,2.0,9.01,0.429954,YG7,7lwd
6,WSEQXVZVJXJVFP_on_P31645_WT,High,ChEMBL30,CHEMBL1508;CHEMBL1508;CHEMBL1508;CHEMBL549;CHE...,CN(C)CCCC1(c2ccc(F)cc2)OCc2cc(C#N)ccc21,WSEQXVZVJXJVFP,WSEQXVZVJXJVFP-UHFFFAOYSA-N,InChI=1S/C20H21FN2O/c1-23(2)11-3-10-20(17-5-7-...,"""AuxInfo=1/0/N:1,3,5,22,9,14,10,13,23,6,4,18,2...",P31645_WT,...,=,8.75;9.0;9.0;7.48;8.36;9.11;7.8;8.8;8.59;8.13;...,8.231667,1.105407,0.319103,12.0,8.625,0.555975,68P,5i73
7,WSEQXVZVJXJVFP_on_P31645_WT,High,ChEMBL30,CHEMBL1508;CHEMBL1508;CHEMBL1508;CHEMBL549;CHE...,CN(C)CCCC1(c2ccc(F)cc2)OCc2cc(C#N)ccc21,WSEQXVZVJXJVFP,WSEQXVZVJXJVFP-UHFFFAOYSA-N,InChI=1S/C20H21FN2O/c1-23(2)11-3-10-20(17-5-7-...,"""AuxInfo=1/0/N:1,3,5,22,9,14,10,13,23,6,4,18,2...",P31645_WT,...,=,8.75;9.0;9.0;7.48;8.36;9.11;7.8;8.8;8.59;8.13;...,8.231667,1.105407,0.319103,12.0,8.625,0.555975,68P,5i71
8,WSEQXVZVJXJVFP_on_P31645_WT,High,ChEMBL30,CHEMBL1508;CHEMBL1508;CHEMBL1508;CHEMBL549;CHE...,CN(C)CCCC1(c2ccc(F)cc2)OCc2cc(C#N)ccc21,WSEQXVZVJXJVFP,WSEQXVZVJXJVFP-UHFFFAOYSA-N,InChI=1S/C20H21FN2O/c1-23(2)11-3-10-20(17-5-7-...,"""AuxInfo=1/0/N:1,3,5,22,9,14,10,13,23,6,4,18,2...",P31645_WT,...,=,8.75;9.0;9.0;7.48;8.36;9.11;7.8;8.8;8.59;8.13;...,8.231667,1.105407,0.319103,12.0,8.625,0.555975,68P,5i75
