# PDBe API Training

### PDBe Interactions

This tutorial will guide you through searching PDBe programmatically.


First we will import the code which will do the work
Run the cell below - by pressing the green play button.

In [6]:
import pandas as pd
print(pd.__version__)
import numpy as np
import requests
import matplotlib.pyplot as plt
from IPython.display import SVG, display
import sys
sys.path.insert(0,'..') # to ensure the below import works in all Jupyter notebooks
from python_modules.api_modules import run_sequence_search, explode_dataset, get_ligand_site_data


0.23.4


Now we are ready to actually run the sequence search we did in the last module

We will search for a sequence with an example sequence from UniProt P08659 -
Luciferin 4-monooxygenase

In [4]:
sequence_to_search = """
MEDAKNIKKGPAPFYPLEDGTAGEQLHKAMKRYALVPGTIAFTDAHIEVNITYAEYFEMS
VRLAEAMKRYGLNTNHRIVVCSENSLQFFMPVLGALFIGVAVAPANDIYNERELLNSMNI
SQPTVVFVSKKGLQKILNVQKKLPIIQKIIIMDSKTDYQGFQSMYTFVTSHLPPGFNEYD
FVPESFDRDKTIALIMNSSGSTGLPKGVALPHRTACVRFSHARDPIFGNQIIPDTAILSV
VPFHHGFGMFTTLGYLICGFRVVLMYRFEEELFLR
SLQDYKIQSALLVPTLFSFFAKSTL
IDKYDLSNLHEIASGGAPLSKEVGEAVAKRFHLPGIRQGYGLTETTSAILITPEGDDKPG
AVGKVVPFFEAKVVDLDTGKTLGVNQRGELCVRGPMIMSGYVNNPEATNALIDKDGWLHS
GDIAYWDEDEHFFIVDRLKSLIKYKGYQVAPAELESILLQHPNIFDAGVAGLPDDDAGEL
PAAVVVLEHGKTMTEKEIVDYVASQVTTAKKLRGGVVFVDEVPKGLTGKLDARKIREILI
KAKKGGKSKL
"""
filter_list = ['pfam_accession', 'pdb_id', 'molecule_name', 'ec_number', 'uniprot_accession_best', 'tax_id']
#you could add terms to this list if, in future, you wish to sort the results by something else. Some of these terms are listed at https://www.ebi.ac.uk/pdbe/api/doc/search.html

search_results = run_sequence_search(sequence_to_search,
                                     filter_terms=filter_list,
                                     number_of_rows=1000
                                     )
#if you expect your results to return more than the number of results listed above, then you should increase this number. However, the bigger the number, the slower the search - perhaps check the script with the smaller number and then run the final analysis on all the results.

print(search_results[0])
#to see the first result that is returned - shows something is there!

Number of results 315
{'chain_id': ['A', 'B'], 'ec_number': ['1.13.12.7'], 'entity_id': 1, 'entry_entity': '5kyv_1', 'molecule_name': ['Luciferin 4-monooxygenase'], 'pdb_id': '5kyv', 'pfam_accession': ['PF00501', 'PF13193'], 'tax_id': [7054], 'uniprot_accession_best': ['P08659'], 'e_value': 0.0, 'percentage_identity': 99.5, 'result_sequence': None}


In the above script, we defined a specific sequence for Uniprot ID P08659. If you want to repeat the analysis for any other sequence, you can change it in this box and run the search again.

Although you do not see it, the results of your search are now stored in the output of the Jupyter notebook, under the variable 'search_results'. This includes all the information in the 'filter_list' above for each result returned by the search.

To more easily sort and analyse this data, we will load the data from the 'search_results' output into a Dataframe (defined initially as 'df')

In [3]:
df = explode_dataset(search_results)
df = df.query('percentage_identity > 80')
#Querying for only results with percentage identity over 80%
#You could e.g. reduce this number to broaden your search for more distantly related homologs.
group_by_uniprot = df.groupby('uniprot_accession_best').count().sort_values('pdb_id', ascending=False)
#this is just removing duplicates of Uniprot IDs

AttributeError: 'DataFrame' object has no attribute 'explode'

How many UniProt accessions were there?

In [None]:
len(group_by_uniprot)

lets look at the data to see what we have

In [None]:
group_by_uniprot.head()

get the first UniProt from the results

In [None]:
uniprot_accession = df['uniprot_accession_best'].iloc[0]

uniprot_accession

Get compounds which interact with the UniProt

In [None]:
ligand_data = get_ligand_site_data(uniprot_accession=uniprot_accession)
df2 = explode_dataset(ligand_data)

Some post processing is required to separating interactingPDBEntries into separate columns

In [None]:
data = pd.json_normalize(df2['interactingPDBEntries'])
df3 = df2.join(data).drop(columns='interactingPDBEntries')


startIndex and endIndex are the UniProt residue number, so we'll make a new column called residue_number
and copy the startIndex there.
We are also going to "count" the number of results - so we'll make a dummy count column to store it in

In [None]:
df3['residue_number'] = df3['startIndex']
df3['count'] = df3['pdbId']

Now we are ready to use the data.

In [None]:
df3.head()

Ligands which interact with every entry have an interaction_ratio of 1.0.
So lets get them....

In [None]:
df4 = df3.groupby('ligand_accession')['interaction_ratio'].mean().reset_index()
ret = df4.query('interaction_ratio == 1.0')['ligand_accession'].values

In [None]:
ret

We have GOL (glycerol) and PEG in the list of ligands so this isn't enough to filter them

So, lets see if we can filter ligands by which ligands interact with the residues which have the most interactions.

First lets see how many interactions we have per residue.

In [None]:
df4 = df3.groupby('residue_number')['count'].count().reset_index()

In [None]:
df4.plot.scatter(x='residue_number', y='count')

We can the determine the mean number of interactions

In [None]:
mean = df4.mean()
mean

We need to extract the mean value as a number from the above result

In [None]:
mean_value = float(mean.values[1])
mean_value

Then we can plot residues which have more interactions than the mean in red
and those which are equal to or below in blue.

In [None]:
fig, ax = plt.subplots() # this makes one plot with an axis "ax" which we can add several plots to
df4.query('count <= {}'.format(mean_value)).plot.scatter(x='residue_number', y='count', color='blue', ax=ax)
df4.query('count > {}'.format(mean_value)).plot.scatter(x='residue_number', y='count', color='red', ax=ax)
ax.axhline(mean_value)
plt.show()
plt.close()

We can see the red interactions cluster around residue 250-400

The actual residues are

In [None]:
all_data_over_mean = df4.query('count > {}'.format(mean_value))
all_data_over_mean


we only want the residue numbers for the next step

In [None]:
residue_numbers_over_mean = all_data_over_mean['residue_number']
residue_numbers_over_mean

What ligands interact with these residues?

Now we want to get all ligand_accessions which interact with a residue in "residue_numbers_over_mean"

In [None]:
df5  = df3[df3['residue_number'].isin(residue_numbers_over_mean)]['ligand_accession']
df5

The same ligand appears several times so we an "unique" the list to get our list of ligands
which have a number of interactions over the mean interaction count.

In [None]:
interesting_ligands = list(df5.unique())
interesting_ligands

It's worth seeing which ligands are not in our list

In [None]:
all_ligands = list(df3['ligand_accession'].unique())

missing_ligands = [x for x in all_ligands if x not in interesting_ligands]
missing_ligands

Now we can display the interactions only for those ligands we have found

We will start with our Dataframe df3

In [None]:
df3.head()

We will select only ligands which interact the most in a Dataframe df6

In [None]:
df6 = df3.groupby(['residue_number', 'ligand_accession'])['interaction_ratio'].mean().reset_index()

We are going to scale the interactions as we use this later

In [None]:
df6['interaction_ratio'] = df6['interaction_ratio'].apply(lambda x: x*2)
df6

Now we can plot the ligand interactions of those ligands which interact with the most interacting residues.

We will put each ligand on a row and scale the interactions by the percentage of PDB entries they are seen in.


In [None]:
# prepare a figure
plt.figure(figsize=(10, 20))
fig, ax = plt.subplots()

# plot the less interesting ligands in blue
for ligand in missing_ligands:
    data = df6[df6['ligand_accession'] == ligand]
    y_data = [ligand] * len(data)
    data['y_data'] = y_data
    data.plot.scatter(x='residue_number', y='y_data', ax=ax, s='interaction_ratio', c='blue')

# plot the interesting ligands in red
for ligand in interesting_ligands:
    data = df6[df6['ligand_accession'] == ligand]
    y_data = [ligand] * len(data)
    data['y_data'] = y_data
    data.plot.scatter(x='residue_number', y='y_data', ax=ax, s='interaction_ratio', c='red')

plt.ylabel('Ligand')
plt.xlabel('UniProt Residue Number')
plt.title('Residues which interact with ligands,\nSpheres scaled by amount of times each interaction is seen in PDB entries')
plt.show()
plt.close()