# PDBe API Training

### PDBe Ligand Interactions for a given protein

This tutorial will guide you through searching PDBe for ligand interactions programmatically.


### Setup

First we will import the code which is required to search the API and plot the results.

Run the cell below - by pressing the play button.

In [None]:
import pandas as pd
import numpy as np
import requests
from pprint import pprint
import matplotlib.pyplot as plt
from IPython.display import SVG, display
import sys
sys.path.insert(0,'..')
from tutorial_utilities.api_modules import (
    explode_dataset, 
    get_ligand_site_data, 
    get_similar_ligand_data, 
    get_ligand_role_data
)

### Obtaining data

Now we are ready to find all the ligands and its interaction details bound to a given protein.

We will get ligands for the Human Acetylcholinesterase, which has the UniProt accession P22303.

In [None]:
uniprot_accession = 'P22303'
ligand_data = get_ligand_site_data(uniprot_accession=uniprot_accession)

The above code gets a list of all ligands which interact with the UniProt. The function "get_ligand_site_data" also calculates interaction ratio for each residue within their respective ligand binding site.

The interaction ratio for each residue within a ligand binding site is calculated by dividing the total number of PDB entries where the given residue interacts with the ligand by the total number of PDB entries that bind to that ligand. The interaction ratio represents the proportion of PDB entries that show an interaction between the residue and the specific ligand.

This is a lot of information so it will need reformatting to become useful

In [None]:
# Print data - Warning there is a lot of data here!
#pprint(ligand_data)

### Reformatting the data

In [None]:
df2 = explode_dataset(result=ligand_data, column_to_explode='interactingPDBEntries')

Some post processing is required to separating interactingPDBEntries into separate columns

In [None]:
print(df2.head())

In [None]:
data = pd.json_normalize(df2['interactingPDBEntries'])
df3 = df2.join(data).drop(columns='interactingPDBEntries')


startIndex and endIndex are the UniProt residue number, so we'll make a new column called residue_number
and copy the startIndex there.
We are also going to "count" the number of results - so we'll make a dummy count column to store it in

In [None]:
df3['residue_number'] = df3['startIndex']
df3['count'] = df3['pdbId']

### Exploratory analysis

Now that the data has been reformatted we are ready to perform some exploratory analysis.

In [None]:
df3.head()

A higher interaction ratio indicates that the residue is more likely to consistently interact with the ligand across multiple protein structures. It suggests that the residue plays a crucial role in the binding of the ligand within the binding site. On the other hand, a lower interaction ratio suggests that the residue's interaction with the ligand may be less consistent or may occur in a more context-dependent manner.

By calculating the interaction ratio for each residue within a ligand binding site, you can gain insights into the residues that are consistently involved in the binding of a specific ligand. This information can be valuable in understanding the key interactions between the ligand and the protein and potentially guide further studies or drug design efforts targeting that binding site.

Ligands which tend to interact with well-defined residues consistently across all PDB entries have interaction_ratio of 1.0. So lets get them....

In [None]:
ret = df3.query('interaction_ratio == 1.0')['ligand_accession'].unique()

In [None]:
ret

Lets see if we can filter ligands by which ligands interact with the residues which have the most interactions.

First lets see how many interactions we have per residue.

In [None]:
df4 = df3.groupby('residue_number')['count'].count().reset_index()

In [None]:
df4.plot.scatter(x='residue_number', y='count')

### Obtaining summary statistics

We can also obtain summary statistics for the interactions. 

For example, the mean number of interactions and the standard deviation

In [None]:
mean = df4.mean()
std = df4.std()
print(f"Mean:")
print(mean)
print("Standard deviation:")
print(std)

To make the number easier to access, we can extract them and save them to new variables:

In [None]:
mean_value = float(mean.values[1])
std_value = float(std.values[1])
print(mean_value, std_value)

### Finding the residues that form the most interactions

Then we can plot residues which have more interactions than the mean in red
and those which are equal to or below in blue.

In [None]:
fig, ax = plt.subplots() # this makes one plot with an axis "ax" which we can add several plots to
df4.query('count <= {}'.format(mean_value)).plot.scatter(x='residue_number', y='count', color='blue', ax=ax)
df4.query('count > {}'.format(mean_value)).plot.scatter(x='residue_number', y='count', color='red', ax=ax)
ax.axhline(mean_value)
plt.show()
plt.close()

A higher threshold (two standard deviations) would be more useful to select only the most common ligand-binding residues

In [None]:
two_std_value = std_value * 2
fig, ax = plt.subplots() # this makes one plot with an axis "ax" which we can add several plots to
df4.query('count <= {}'.format(two_std_value)).plot.scatter(x='residue_number', y='count', color='blue', ax=ax)
df4.query('count > {}'.format(two_std_value)).plot.scatter(x='residue_number', y='count', color='red', ax=ax)
ax.axhline(two_std_value)
plt.show()
plt.close()

A list of the residues that form the most interactions (over 2 std) can be obtained:

In [None]:
all_data_over_two_std = df4.query('count > {}'.format(two_std_value)).sort_values(by='count', ascending=False)
all_data_over_two_std


### Finding the ligands that interact with these residues

we only want the residue numbers for the next step

In [None]:
residue_numbers_over_two_std = all_data_over_two_std['residue_number']
residue_numbers_over_two_std

What ligands interact with these residues?

Now we want to get all ligand_accessions which interact with a residue in "residue_numbers_over_two_std"

In [None]:
df5  = df3[df3['residue_number'].isin(residue_numbers_over_two_std)]['ligand_accession']
df5

The same ligand appears several times so we an "unique" the list to get our list of ligands
which have a number of interactions over the mean interaction count.

In [None]:
interesting_ligands = list(df5.unique())
interesting_ligands

PDBe-KB annotates ligands as drug-like, co-factor-like or reactant-like. This mapping is based on mapping to DrugBank, similarity with co-factor templates or ChEBI/Rhea databases. Let's see if any of these interesting_ligands we found has any functional role.

In [None]:
# get ligand annotations from PDBe-KB
data=get_ligand_role_data(uniprot_accession)
print(data[0])

In [None]:
# convert the data in pandas dataframe
df=pd.DataFrame(data)
print(df.head)

In [None]:
# find all the ligand which have any functional role
df[df['acts_as']!='']

It's worth seeing which ligands are not in our list

In [None]:
all_ligands = list(df3['ligand_accession'].unique())

missing_ligands = [x for x in all_ligands if x not in interesting_ligands]
missing_ligands

Now we can display the interactions only for those ligands we have found

We will start with our Dataframe df3

In [None]:
df3.head()

We will select only ligands which interact the most in a Dataframe df6

In [None]:
df6 = df3.groupby(['residue_number', 'ligand_accession'])['interaction_ratio'].mean().reset_index()

We are going to scale the interactions as we use this later

In [None]:
df6['interaction_ratio'] = df6['interaction_ratio'].apply(lambda x: x*2)
df6

Now we can plot the ligand interactions of those ligands which interact with the most interacting residues.

We will put each ligand on a row and scale the interactions by the percentage of PDB entries they are seen in.


In [None]:
# prepare a figure
plt.rcParams['figure.figsize'] = [12, 12]
fig, ax = plt.subplots()

# plot the less interesting ligands in blue
for ligand in missing_ligands:
    data = df6[df6['ligand_accession'] == ligand]
    data.plot.scatter(x='residue_number', y='ligand_accession', ax=ax, s='interaction_ratio', c='blue')

# plot the interesting ligands in red
for ligand in interesting_ligands:
    data = df6[df6['ligand_accession'] == ligand]
    data.plot.scatter(x='residue_number', y='ligand_accession', ax=ax, s='interaction_ratio', c='red')


plt.ylabel('Ligand')
plt.xlabel('UniProt Residue Number')
plt.title('Residues which interact with ligands,\nSpheres scaled by amount of times each interaction is seen in PDB entries')
plt.show()
plt.close()

### Comparing the chemical structures of ligands

It would be interesting to see if the similar ligands bind to the same residues. Let's take any ligand from above interesting ligand dataset and find all the other ligands which are similar to it. 

In the example below, we have taken the neurotoxin VX (HET CODE- VX) and acetylcholinesterase inhibitor. We found all the ligands similar to VX using "get_similar_ligand_data" function. This function takes ligand name and similarity cutoff (0-1) as arguments. Here we have used similarity cutoff of 0.7 and found all the ligands which are 70 % or more similar to VX. 

In [None]:
ligand_exp = "VX"
similarity_cutoff = 0.7
#finding similar ligands to ligand_exp 
similar_ligands = get_similar_ligand_data(ligand_exp, similarity_cutoff)
sdf=pd.DataFrame(similar_ligands.items(),columns = ['similar_ligand','similarity_score'])
print(sdf)
#find common ligands from similar_ligands and interesting_ligands
common_ligands = [item for item in similar_ligands if item in interesting_ligands]
print(f"common ligands include - {common_ligands}")

Now, you can compare if VX and common_ligands bind to same residues or not and check if similar ligands tend to bind to similar sites.

In [None]:
# get the binding site for your ligand of interest
binding_site_1=sorted(df3[df3['ligand_accession'] == 'VX']['residue_number'].unique())
print(f"binding site for VX: {binding_site_1}")
# get binding site for common ligands 
common_binding_residues = []
for ligand in common_ligands :
    binding_site_2=sorted(df3[df3['ligand_accession'] == ligand]['residue_number'].unique())
    print(f"binding site for {ligand}: {binding_site_2}")
    common_binding_residues += [item for item in binding_site_2 if item in binding_site_1]
print(f"common binding site residues {common_binding_residues}")