# PDBe API Training

### PDBe Macromolecular Interactions for a given protein

This tutorial will guide you through searching PDBe for macromolecular interactions programmatically.

## Setup

First we will import the code which is required to search the API and plot the results.

Run the cell below - by pressing the play button.

In [15]:
import pandas as pd
from pprint import pprint
import sys
sys.path.insert(0,'..')

# Import two functions already defined in previous tutorials
from tutorial_utilities.api_modules import get_url, explode_dataset

---
---

## Obtaining data

Now we are ready to find all the macromolecular interactions that a protein in the PDB archive forms.

We will get the macromolecular interactions of Human Acetylcholinesterase, which has the UniProt accession P22303.

In [16]:
BASE_URL = "https://www.ebi.ac.uk/pdbe/"
PDBEKB_UNIPROT_URL = BASE_URL + "graph-api/uniprot/"

def get_macromolecule_interaction_data(uniprot_accession):
    """
    Get all the macromolecule interaction data for a given UniProt accession
    """

    url = PDBEKB_UNIPROT_URL + "interface_residues/" + uniprot_accession
    print(url)

    data = get_url(url=url)
    data_to_ret = []

    for data_uniprot_accession in data:
        accession_data = data.get(data_uniprot_accession)
        length = accession_data['length']

        for row in accession_data['data']:
            interaction_accession = row['accession']
            all_pdb_entries = row['allPDBEntries']
            name = row['name']

            # Get the accession type, return empty dictionary if not present
            accession_type = row.get('additionalData', {})
            accession_type = accession_type['type']

            for residue in row.get('residues', []):
                residue['interaction_accession'] = interaction_accession
                residue['interaction_name'] = name
                residue['length'] = length
                residue['uniprot_accession'] = uniprot_accession
                residue['interaction_accession_type'] = accession_type

                # Get the interacting PDB entries, return empty list if not present
                interacting_entries = residue.get('interactingPDBEntries', [])
                residue['interacting_pdb_entries'] = interacting_entries
                residue['interaction_ratio'] = len(
                    interacting_entries
                ) / len(
                    all_pdb_entries
                )
                residue['allPDBEntries'] = all_pdb_entries
                data_to_ret.append(residue)

    return data_to_ret

In [17]:
interaction_data = get_macromolecule_interaction_data(uniprot_accession="P22303")
pprint(interaction_data[0])

https://www.ebi.ac.uk/pdbe/graph-api/uniprot/interface_residues/P22303
{'allPDBEntries': ['6o5r',
                   '8dt4',
                   '6o5s',
                   '1vzj',
                   '6cqx',
                   '8aen',
                   '5hf6',
                   '6ntn',
                   '6o4w',
                   '6ntm',
                   '4ey6',
                   '6u3p',
                   '7e3d',
                   '5hf5',
                   '6wvq',
                   '7p1n',
                   '8dt5',
                   '6cqw',
                   '4m0e',
                   '6cqz',
                   '4ey4',
                   '5hf8',
                   '6o50',
                   '6wvc',
                   '3lii',
                   '7d9p',
                   '6ntg',
                   '7e3h',
                   '4ey5',
                   '5fpq',
                   '6u37',
                   '6cqu',
                   '6o66',
                   '6wuz',
           

---
---

## Reformatting the data

The output results of the query contain all the information about the macromolecular interactions, however it is in a complex nested list that makes it difficult to parse without reformatting.

The following code simplifies the data, flattening the nested format:

In [18]:
# Reformat data to make it a list of the macromolecular interactions found in the PDB archive for the protein
df_exploded = explode_dataset(
    result=interaction_data, 
    column_to_explode='interactingPDBEntries'
)

In [25]:
df_exploded.head()

Unnamed: 0,startIndex,endIndex,startCode,endCode,indexType,interactingPDBEntries,allPDBEntries,interaction_accession,interaction_name,length,uniprot_accession,interaction_accession_type,interacting_pdb_entries,interaction_ratio
0,56,56,PRO,PRO,UNIPROT,"{'pdbId': '7p1n', 'entityId': 2, 'chainIds': '...","[6o5r, 8dt4, 6o5s, 1vzj, 6cqx, 8aen, 5hf6, 6nt...",P22303,Acetylcholinesterase,614,P22303,UNP,"[{'pdbId': '7p1n', 'entityId': 2, 'chainIds': ...",0.028169
1,56,56,PRO,PRO,UNIPROT,"{'pdbId': '7p1p', 'entityId': 2, 'chainIds': '...","[6o5r, 8dt4, 6o5s, 1vzj, 6cqx, 8aen, 5hf6, 6nt...",P22303,Acetylcholinesterase,614,P22303,UNP,"[{'pdbId': '7p1n', 'entityId': 2, 'chainIds': ...",0.028169
2,74,74,GLY,GLY,UNIPROT,"{'pdbId': '7p1n', 'entityId': 2, 'chainIds': '...","[6o5r, 8dt4, 6o5s, 1vzj, 6cqx, 8aen, 5hf6, 6nt...",P22303,Acetylcholinesterase,614,P22303,UNP,"[{'pdbId': '7p1n', 'entityId': 2, 'chainIds': ...",0.028169
3,74,74,GLY,GLY,UNIPROT,"{'pdbId': '7p1p', 'entityId': 2, 'chainIds': '...","[6o5r, 8dt4, 6o5s, 1vzj, 6cqx, 8aen, 5hf6, 6nt...",P22303,Acetylcholinesterase,614,P22303,UNP,"[{'pdbId': '7p1n', 'entityId': 2, 'chainIds': ...",0.028169
4,75,75,PRO,PRO,UNIPROT,"{'pdbId': '7p1n', 'entityId': 2, 'chainIds': '...","[6o5r, 8dt4, 6o5s, 1vzj, 6cqx, 8aen, 5hf6, 6nt...",P22303,Acetylcholinesterase,614,P22303,UNP,"[{'pdbId': '7p1n', 'entityId': 2, 'chainIds': ...",0.028169


---
---

## Exploring the data

The following code lists all the unique macromolecules (UniProt IDs) that interact with human Acetylcholinesterase in the PDB archive:

**--The following filtering fulfils Project Aim 1B--**

In [20]:
df_exploded['interaction_accession'].unique()

array(['P22303', 'P0C1Z0', 'Q9Y215'], dtype=object)

Some post processing is required to reformat interactingPDBEntries into separate columns. Here we convert the interactingPDBEntries column from a semi-structured JSON data format into a flat table:

In [27]:
data = pd.json_normalize(df_exploded['interactingPDBEntries'])

# Add the new columns to the dataframe
df_interactions = df_exploded.join(data)
# Remove some columns that are not needed
df_interactions = df_interactions.drop(columns='interactingPDBEntries')

In [22]:
df_interactions.head()

Unnamed: 0,startIndex,endIndex,startCode,endCode,indexType,allPDBEntries,interaction_accession,interaction_name,length,uniprot_accession,interaction_accession_type,interacting_pdb_entries,interaction_ratio,pdbId,entityId,chainIds
0,56,56,PRO,PRO,UNIPROT,"[6o5r, 8dt4, 6o5s, 1vzj, 6cqx, 8aen, 5hf6, 6nt...",P22303,Acetylcholinesterase,614,P22303,UNP,"[{'pdbId': '7p1n', 'entityId': 2, 'chainIds': ...",0.028169,7p1n,2,"B,A"
1,56,56,PRO,PRO,UNIPROT,"[6o5r, 8dt4, 6o5s, 1vzj, 6cqx, 8aen, 5hf6, 6nt...",P22303,Acetylcholinesterase,614,P22303,UNP,"[{'pdbId': '7p1n', 'entityId': 2, 'chainIds': ...",0.028169,7p1p,2,"B,A"
2,74,74,GLY,GLY,UNIPROT,"[6o5r, 8dt4, 6o5s, 1vzj, 6cqx, 8aen, 5hf6, 6nt...",P22303,Acetylcholinesterase,614,P22303,UNP,"[{'pdbId': '7p1n', 'entityId': 2, 'chainIds': ...",0.028169,7p1n,2,"B,A"
3,74,74,GLY,GLY,UNIPROT,"[6o5r, 8dt4, 6o5s, 1vzj, 6cqx, 8aen, 5hf6, 6nt...",P22303,Acetylcholinesterase,614,P22303,UNP,"[{'pdbId': '7p1n', 'entityId': 2, 'chainIds': ...",0.028169,7p1p,2,"B,A"
4,75,75,PRO,PRO,UNIPROT,"[6o5r, 8dt4, 6o5s, 1vzj, 6cqx, 8aen, 5hf6, 6nt...",P22303,Acetylcholinesterase,614,P22303,UNP,"[{'pdbId': '7p1n', 'entityId': 2, 'chainIds': ...",0.028169,7p1n,2,"B,A"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1938,595,595,MET,MET,UNIPROT,[1vzj],Q9Y215,Acetylcholinesterase collagenic tail peptide,614,P22303,UNP,"[{'pdbId': '1vzj', 'entityId': 2, 'chainIds': ...",1.000000,1vzj,2,"J,I"
1939,598,598,TRP,TRP,UNIPROT,[1vzj],Q9Y215,Acetylcholinesterase collagenic tail peptide,614,P22303,UNP,"[{'pdbId': '1vzj', 'entityId': 2, 'chainIds': ...",1.000000,1vzj,2,"J,I"
1940,601,601,GLN,GLN,UNIPROT,[1vzj],Q9Y215,Acetylcholinesterase collagenic tail peptide,614,P22303,UNP,"[{'pdbId': '1vzj', 'entityId': 2, 'chainIds': ...",1.000000,1vzj,2,"J,I"
1941,602,602,PHE,PHE,UNIPROT,[1vzj],Q9Y215,Acetylcholinesterase collagenic tail peptide,614,P22303,UNP,"[{'pdbId': '1vzj', 'entityId': 2, 'chainIds': ...",1.000000,1vzj,2,"J,I"


startIndex and endIndex are the UniProt residue number, so we'll make a new column called residue_number and copy the startIndex there. We are also going to "count" the number of results - so we'll make a dummy count column to store it in.

In [23]:
df_interactions['residue_number'] = df_interactions['startIndex']
df_interactions['count'] = df_interactions['pdbId']

Now we are ready to use the data.

In [24]:
df_interactions.head()
# TODO Add more at end

Unnamed: 0,startIndex,endIndex,startCode,endCode,indexType,allPDBEntries,interaction_accession,interaction_name,length,uniprot_accession,interaction_accession_type,interacting_pdb_entries,interaction_ratio,pdbId,entityId,chainIds,residue_number,count
0,56,56,PRO,PRO,UNIPROT,"[6o5r, 8dt4, 6o5s, 1vzj, 6cqx, 8aen, 5hf6, 6nt...",P22303,Acetylcholinesterase,614,P22303,UNP,"[{'pdbId': '7p1n', 'entityId': 2, 'chainIds': ...",0.028169,7p1n,2,"B,A",56,7p1n
1,56,56,PRO,PRO,UNIPROT,"[6o5r, 8dt4, 6o5s, 1vzj, 6cqx, 8aen, 5hf6, 6nt...",P22303,Acetylcholinesterase,614,P22303,UNP,"[{'pdbId': '7p1n', 'entityId': 2, 'chainIds': ...",0.028169,7p1p,2,"B,A",56,7p1p
2,74,74,GLY,GLY,UNIPROT,"[6o5r, 8dt4, 6o5s, 1vzj, 6cqx, 8aen, 5hf6, 6nt...",P22303,Acetylcholinesterase,614,P22303,UNP,"[{'pdbId': '7p1n', 'entityId': 2, 'chainIds': ...",0.028169,7p1n,2,"B,A",74,7p1n
3,74,74,GLY,GLY,UNIPROT,"[6o5r, 8dt4, 6o5s, 1vzj, 6cqx, 8aen, 5hf6, 6nt...",P22303,Acetylcholinesterase,614,P22303,UNP,"[{'pdbId': '7p1n', 'entityId': 2, 'chainIds': ...",0.028169,7p1p,2,"B,A",74,7p1p
4,75,75,PRO,PRO,UNIPROT,"[6o5r, 8dt4, 6o5s, 1vzj, 6cqx, 8aen, 5hf6, 6nt...",P22303,Acetylcholinesterase,614,P22303,UNP,"[{'pdbId': '7p1n', 'entityId': 2, 'chainIds': ...",0.028169,7p1n,2,"B,A",75,7p1n


Now, you can do similar analysis as you did in 2_ligand_interactions.ipynb. Investigate if there is an overlap in drug-binding and macro-molecular binding site.