# PDBe API Training

### PDBe Macromolecular Interactions for a given protein

This tutorial will guide you through searching PDBe for macromolecular interactions programmatically.

## Setup

First we will import the code which is required to search the API and plot the results.

Run the cell below - by pressing the play button.

In [None]:
import pandas as pd
from pprint import pprint
import sys
sys.path.insert(0,'..')

# Import two functions already defined in previous tutorials
from tutorial_utilities.api_modules import get_url, explode_dataset

---
---

## Obtaining data

Now we are ready to find all the macromolecular interactions that a protein in the PDB archive forms.

We will get the macromolecular interactions of Human Acetylcholinesterase, which has the UniProt accession P22303.

In [None]:
BASE_URL = "https://www.ebi.ac.uk/pdbe/"
PDBEKB_UNIPROT_URL = BASE_URL + "graph-api/uniprot/"

def get_macromolecule_interaction_data(uniprot_accession):
    """
    Get all the macromolecule interaction data for a given UniProt accession
    """

    url = PDBEKB_UNIPROT_URL + "interface_residues/" + uniprot_accession
    print(url)

    data = get_url(url=url)
    data_to_ret = []

    for data_uniprot_accession in data:
        accession_data = data.get(data_uniprot_accession)
        length = accession_data['length']

        for row in accession_data['data']:
            interaction_accession = row['accession']
            all_pdb_entries = row['allPDBEntries']
            name = row['name']

            # Get the accession type, return empty dictionary if not present
            accession_type = row.get('additionalData', {})
            accession_type = accession_type['type']

            for residue in row.get('residues', []):
                residue['interaction_accession'] = interaction_accession
                residue['interaction_name'] = name
                residue['length'] = length
                residue['uniprot_accession'] = uniprot_accession
                residue['interaction_accession_type'] = accession_type

                # Get the interacting PDB entries, return empty list if not present
                interacting_entries = residue.get('interactingPDBEntries', [])
                residue['interacting_pdb_entries'] = interacting_entries
                residue['interaction_ratio'] = len(
                    interacting_entries
                ) / len(
                    all_pdb_entries
                )
                residue['allPDBEntries'] = all_pdb_entries
                data_to_ret.append(residue)

    return data_to_ret

In [None]:
interaction_data = get_macromolecule_interaction_data(uniprot_accession="P22303")
pprint(interaction_data[0])

---
---

## Reformatting the data

The output results of the query contain all the information about the macromolecular interactions, however it is in a complex nested list that makes it difficult to parse without reformatting.

The following code simplifies the data, flattening the nested format:

In [None]:
# Reformat data to make it a list of the macromolecular interactions found in the PDB archive for the protein
df_all_interactions = explode_dataset(
    result=interaction_data, 
    column_to_explode='interactingPDBEntries'
)

In [None]:
df_all_interactions.head()

---
---

## Exploring the data

The following code lists all the unique macromolecules (UniProt IDs) that interact with human Acetylcholinesterase in the PDB archive:

**--The following filtering fulfils Project Aim 1B--**

In [None]:
df_all_interactions['interaction_accession'].unique()

Some post processing is required to reformat `interactingPDBEntries` into separate columns. Here we convert the `interactingPDBEntries` column from a semi-structured JSON data format into a flat table:

In [None]:
data = pd.json_normalize(df_all_interactions['interactingPDBEntries'])

# Add the new columns to the dataframe
df_interactions = df_all_interactions.join(data)
# Remove some columns that are not needed
df_interactions = df_interactions.drop(columns='interactingPDBEntries')

In [None]:
df_interactions.head()

`startIndex` and `endIndex` are the UniProt residue number, so we'll make a new column called residue_number and copy the `startIndex` there. We are also going to `count` the number of results - so we'll make a dummy count column to store it in.

In [None]:
df_interactions['residue_number'] = df_interactions['startIndex']
df_interactions['count'] = df_interactions['pdbId']

Now we are ready to use the data.

In [None]:
df_interactions.head()

Now, you can do similar analysis as you did in `2_ligand_interactions.ipynb`. Investigate if there is an overlap in drug-binding and macro-molecular binding site.