# PDBe API Training

### PDBe Protein complexes for a given protein

This tutorial will guide you through searching PDBe programmatically to find the protein complexes formed by a given protein.

## Setup

First we will import the code which is required to search the API and reformat the results.

Run the cell below - by pressing the play button.

In [None]:
import sys
sys.path.insert(0, '..')
from tutorial_utilities.api_modules import explode_dataset, get_url
import pandas as pd

---
---

## Obtaining data

Now we will find all the complexes containing the Human Alpha-4 subunit of Nicotinic Acetylcholine Receptor Complex (Uniprot accession: P43681).

In [None]:
BASE_URL = "https://www.ebi.ac.uk/pdbe/"
PDBEKB_UNIPROT_URL = BASE_URL + "graph-api/uniprot/"


def get_complexes_protein_data(accession):
    """
    Get all the protein complexes observed in PDB entries for a given UniProt accession        
    """
    
    url = f"{PDBEKB_UNIPROT_URL}complex/{accession}"
    print(url)

    dictfilt = lambda x, y: dict([ (i,x[i]) for i in x if i in set(y) ])
    data = get_url(url=url)
    data_out = []
    
    for row in data[accession]:
        
        # Get complex ID for row
        complex_id = list(row.keys())[0]

        my_row = {"complex_id": complex_id}
        my_row.update(row[complex_id])
        
        # Example of list comprehension to quickly create a list
        necc_rows = [keys for keys in my_row.keys() if keys !='participants']
        necc_rows = dictfilt(my_row,necc_rows)

        for item in my_row['participants'] :
            # Example of dictionary comprehension to quickly create a dictionary
            dict3 = {k:v for d in (necc_rows,item) for k,v in d.items()}
            data_out.append(dict3)

    
    return data_out

In [None]:
uniprot_accession = 'P43681'
results = get_complexes_protein_data(uniprot_accession) 
results

---
---

## Reformatting the data

The output results of the query contain all the information about the complexes containing the Alpha-4 subunit, however it is in a complex nested list that contains duplicates and this makes it difficult to parse without reformatting.

The following code simplifies the data and removes duplicates by grouping the data by the Complex ID and UniProt Accession:

In [None]:
# Reformat data using groupby remove repetition in the dataset      
df_complexes_with_duplicates = pd.DataFrame(results)
df_complexes = df_complexes_with_duplicates.groupby(['complex_id', 'accession']).first().reset_index()
df_complexes

We can also add new columns to make the data more human-readable. For example, adding protein names

In [None]:
# Map UniProt IDs to Protein Names
accession_mapping = {
    'P43681': 'Neuronal acetylcholine receptor subunit alpha-4',
    'P17787': 'Neuronal acetylcholine receptor subunit beta-2',
    'P0ABE7': 'Soluble cytochrome b562',
}

# Map Tax IDs to Tax Names
taxonomy_mapping = {
    562: 'Escherichia coli',
    9606: 'Homo sapiens'
}

df_complexes['protein_name'] = df_complexes['accession'].map(accession_mapping)
df_complexes['taxonomy_name'] = df_complexes['taxonomy_id'].map(taxonomy_mapping)

#Reorder columns
new_column_order = ['complex_id', 'subcomplexes', 'accession', 'protein_name', 'stoichiometry', 'taxonomy_id', 'taxonomy_name' ]
df_complexes = df_complexes[new_column_order]


df_complexes

---
---

## Analysing the results

Once the data has been reformatted into a human-readable format, it is simple to obtain the relevant information.


In [None]:
# List all the complexes the subunit Alpha-4 is found in
df_complexes['complex_id'].unique().tolist()

We can also identify all the complexes which contain non-human proteins

In [None]:
rows_with_non_human_proteins = df_complexes[df_complexes['taxonomy_name'] != 'Homo sapiens']
rows_with_non_human_proteins['complex_id'].unique().tolist()

### Optional extras

In the above code finds all complexes where there are UniProt accessions that are not found in humans. Why might this miss some complexes that contain non-human proteins?

---
---

## Writing the results to file

We can save the results to a CSV file which we can load into excel.

In [None]:
df_complexes.to_csv("complexes.csv")