# PDBe API Training

### PDBe Similar proteins for a given protein

This tutorial will guide you through searching PDBe programmatically to find similar proteins for a given protein.

## Setup

First we will import the code which is required to search the API and reformat the results.

Run the cell below - by pressing the play button.

In [None]:
import sys
sys.path.insert(0, '..')
from tutorial_utilities.api_modules import explode_dataset, get_url
import pandas as pd

---
---

## Obtaining data

Now we will find all the proteins with 40% or more sequence identity to Human Acetylcholinesterase (Uniprot accession: P22303).

In [None]:
BASE_URL = "https://www.ebi.ac.uk/pdbe/"
PDBEKB_UNIPROT_URL = BASE_URL + "graph-api/uniprot/"


def get_similar_protein_data(accession, identity):
    """
    Get similar protein data for a given UniProt accession and identity threshold
    """

    url = f"{PDBEKB_UNIPROT_URL}similar_proteins/{accession}/{str(identity)}"
    print(url)

    data = get_url(url=url)

    return data

In [None]:
uniprot_accession = 'P22303'
identity_cutoff = 40
results = get_similar_protein_data(uniprot_accession, identity_cutoff) 

---
---

## Reformatting the data

The output results of the query contain all the information about the similar proteins, however it is in a complex nested list that makes it difficult to parse without reformatting.

The following code simplifies the data, flattening the nested format:

In [None]:
# Reformat data to make it a list of similar proteins for the protein of interest
df_expanded_uniprot = explode_dataset(
    result=results[uniprot_accession], 
    column_to_explode='mapped_segment'
)

# Obtain the flattened data for each similar protein
data = pd.json_normalize(df_expanded_uniprot['mapped_segment'])

# Create reformatted dataset by joining the flattened data with the exploded dataset 
df_similar_proteins = df_expanded_uniprot.join(data)
df_similar_proteins = df_similar_proteins.drop(columns='mapped_segment')

---
---

## Analysing the results

Once the data has been reformatted into a human-readable format, it is simple to obtain the relevant information.

**--This fulfils Project Aim 3A--**

In [None]:
df_similar_proteins.head()

We can also filter this dataframe of similar proteins based on various column values like `sequence_identity`, `taxid` or `species`. For example, we can filter all the similar proteins from Human (`taxid= 9606`) as shown below. 

**--The following filtering fulfils Project Aim 3B--**

In [None]:
df_human_similar = df_similar_proteins[df_similar_proteins['taxid'] == 9606] 
df_human_similar

---
---

## Writing the results to file

We can save the results to a CSV file which we can load into excel.

In [None]:
df_similar_proteins.to_csv("similar_proteins_project_aims_3a.csv")
df_human_similar.to_csv("similar_proteins_project_aims_3b.csv")