# PDBe API Training

### PDBe search

Searching with a sequence

In [1]:
from pprint import pprint # used for pretty printing
import sys
sys.path.insert(0,'..') # to ensure the below import works in all Jupyter notebooks
from python_modules.api_modules import run_sequence_search, pandas_dataset, pandas_count, pandas_plot, pandas_plot_multi_groupby

We will search for a sequence with an example sequence from UniProt P24941 -
Cyclin-dependent kinase 2

In [2]:
sequence_to_search = """
MENFQKVEKIGEGTYGVVYKARNKLTGEVVALKKIRLDTETEGVPSTAIREISLLKELNH
PNIVKLLDVIHTENKLYLVFEFLHQDLKKFMDASALTGIPLPLIKSYLFQLLQGLAFCHS
HRVLHRDLKPQNLLINTEGAIKLADFGLARAFGVPVRTYTHEVVTLWYRAPEILLGCKYY
STAVDIWSLGCIFAEMVTRRALFPGDSEIDQLFRIFRTLGTPDEVVWPGVTSMPDYKPSF
PKWARQDFSKVVPPLDEDGRSLLSQMLHYDPNKRISAKAALAHPFFQDVTKPVPHLRL"""

filter_list = ['pfam_accession', 'pdb_id', 'molecule_name', 'ec_number',
               'uniprot_accession_best', 'tax_id']

first_results = run_sequence_search(sequence_to_search, filter_terms=filter_list)

Number of results 10


Print the first result to see what we have

In [3]:
pprint(first_results[0])

{'e_value': 3.6e-76,
 'ec_number': ['2.7.11.22'],
 'entity_id': 1,
 'entry_entity': '3ezr_1',
 'molecule_name': ['Cyclin-dependent kinase 2'],
 'pdb_id': '3ezr',
 'percentage_identity': 100.0,
 'pfam_accession': ['PF00069'],
 'tax_id': [9606],
 'uniprot_accession_best': ['P24941']}


Before we do any further analysis we should get a few more results so we can see some patterns.
We are going to increase the number of results to 1000

In [5]:
first_results = run_sequence_search(sequence_to_search,
                                    filter_terms=filter_list,
                                    number_of_rows=1000
                                    )


https://www.ebi.ac.uk/pdbe/search/pdb/select?json.nl=map&start=0&sort=fasta(e_value) asc&xjoin_fasta=true&bf=fasta(percentIdentity)&xjoin_fasta.external.expupperlim=0.1&xjoin_fasta.external.sequence=
MENFQKVEKIGEGTYGVVYKARNKLTGEVVALKKIRLDTETEGVPSTAIREISLLKELNH
PNIVKLLDVIHTENKLYLVFEFLHQDLKKFMDASALTGIPLPLIKSYLFQLLQGLAFCHS
HRVLHRDLKPQNLLINTEGAIKLADFGLARAFGVPVRTYTHEVVTLWYRAPEILLGCKYY
STAVDIWSLGCIFAEMVTRRALFPGDSEIDQLFRIFRTLGTPDEVVWPGVTSMPDYKPSF
PKWARQDFSKVVPPLDEDGRSLLSQMLHYDPNKRISAKAALAHPFFQDVTKPVPHLRL&q=*:*&fq={!xjoin}xjoin_fasta&fl=molecule_name,tax_id,entity_id,pdb_id,uniprot_accession_best,entry_entity,pfam_accession,ec_number&wt=json&rows=1000
Number of results 1000


Load the results into a Pandas Dataframe so we can query them

In [6]:
df = pandas_dataset(first_results)

Lets see what we have - you'll see it looks a bit like a spreadsheet or a database

In [7]:
print(df.head())

   ec_number  entity_id entry_entity              molecule_name pdb_id  \
0  2.7.11.22          1       3ezr_1  Cyclin-dependent kinase 2   3ezr   
1  2.7.11.22          1       2r3p_1  Cyclin-dependent kinase 2   2r3p   
2  2.7.11.22          1       2vtr_1  Cyclin-dependent kinase 2   2vtr   
3        NaN          2       3bht_2                  Cyclin-A2   3bht   
4  2.7.11.22          1       3bht_1  Cyclin-dependent kinase 2   3bht   

    pfam_accession tax_id uniprot_accession_best       e_value  \
0          PF00069   9606                 P24941  2.900000e-76   
1          PF00069   9606                 P24941  2.900000e-76   
2          PF00069   9606                 P24941  2.900000e-76   
3  PF00134,PF02984   9913                 P30274  2.900000e-76   
4          PF00069   9606                 P24941  2.900000e-76   

   percentage_identity  
0                100.0  
1                100.0  
2                100.0  
3                100.0  
4                100.0  


We can save the results to a CSV file which we can load into excel

In [8]:
df.to_csv("search_results.csv")

There isn't a cut off of eValue or percentage identity in our search
so we should look what the values go to

we can select the column and find the minimum value with .min() or maximum value with .max()

In [9]:
df['percentage_identity'].max()

100.0

In [10]:
df['percentage_identity'].min()

36.1

same for e value - here we want the min and max


In [11]:
df['e_value'].min()

2.9e-76

In [12]:
df['e_value'].max()

5.5e-20

We can see that percentage identity drops to as low as 36%
Lets say we want to restrict it to 50%

In [13]:
df2 = df.query('percentage_identity > 50')

We stored the results in a new Dataframe called "df2"

In [14]:
df2.head()

Unnamed: 0,ec_number,entity_id,entry_entity,molecule_name,pdb_id,pfam_accession,tax_id,uniprot_accession_best,e_value,percentage_identity
0,2.7.11.22,1,3ezr_1,Cyclin-dependent kinase 2,3ezr,PF00069,9606,P24941,2.9e-76,100.0
1,2.7.11.22,1,2r3p_1,Cyclin-dependent kinase 2,2r3p,PF00069,9606,P24941,2.9e-76,100.0
2,2.7.11.22,1,2vtr_1,Cyclin-dependent kinase 2,2vtr,PF00069,9606,P24941,2.9e-76,100.0
3,,2,3bht_2,Cyclin-A2,3bht,"PF00134,PF02984",9913,P30274,2.9e-76,100.0
4,2.7.11.22,1,3bht_1,Cyclin-dependent kinase 2,3bht,PF00069,9606,P24941,2.9e-76,100.0


Number of entries in the Dataframe

In [15]:
len(df2)

621

Max value of percentage identity

In [16]:
df2['percentage_identity'].max()

100.0

Min value of percentage identity

In [17]:
df2['percentage_identity'].min()

54.2

How many unique Pfam domains or UniProts did we get back?

We can group the results by Pfram using "groupby" and then counting the results

In [18]:
df.groupby('pfam_accession').count()

Unnamed: 0_level_0,ec_number,entity_id,entry_entity,molecule_name,pdb_id,tax_id,uniprot_accession_best,e_value,percentage_identity
pfam_accession,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
PF00069,730,733,733,733,733,733,733,733,733
PF00134,0,9,9,9,9,9,9,9,9
"PF00134,PF02984",0,126,126,126,126,126,126,126,126
"PF00134,PF09080",0,1,1,1,1,1,1,1,1
"PF00134,PF09241",0,5,5,5,5,5,5,5,5
"PF00134,PF16899",0,1,1,1,1,1,1,1,1
"PF00183,PF02518",0,4,4,4,4,4,4,4,4
PF00352,0,1,1,1,1,1,1,1,1
"PF00382,PF08271",1,1,1,1,1,1,1,1,1
PF00481,0,1,1,1,1,1,1,1,1


same for uniprot accession
This time we will sort the values by the number of PDB entries ("pdb_id"'s) they appear in.

In [22]:
group_by_uniprot = df.groupby('uniprot_accession_best').count().sort_values('pdb_id', ascending=False)
group_by_uniprot

Unnamed: 0_level_0,ec_number,entity_id,entry_entity,molecule_name,pdb_id,pfam_accession,tax_id,e_value,percentage_identity
uniprot_accession_best,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
P24941,413,413,413,413,413,413,413,413,413
P28482,109,109,109,109,109,109,109,109,109
P20248,0,95,95,95,95,95,95,95,95
P63086,59,59,59,59,59,59,59,59,59
P47811,40,40,40,40,40,40,40,40,40
...,...,...,...,...,...,...,...,...,...
P36954,0,1,1,1,1,1,1,1,1
P36507,1,1,1,1,1,0,1,1,1
P35269,0,1,1,1,1,1,1,1,1
P35236,1,1,1,1,1,0,1,1,1


In this case the most common UniProt accession is P24941.
How many UniProt accessions were there?

In [20]:
len(group_by_uniprot)

109

How many are enzymes? We can use "ec_number" to work see how many have E.C. numbers

In [25]:
uniprot_with_ec = group_by_uniprot.query('ec_number != 0')

In [26]:
len(uniprot_with_ec)

40