# PDBe API Training

### PDBe search

Searching with a sequence

In [1]:
from pprint import pprint # used for pretty printing
import sys
sys.path.insert(0,'..') # to ensure the below import works in all Jupyter notebooks
from python_modules.api_modules import run_sequence_search, pandas_dataset, pandas_count, pandas_plot, pandas_plot_multi_groupby

We will search for a sequence with an example sequence from UniProt P24941 -
Cyclin-dependent kinase 2

In [2]:
sequence_to_search = """
MENFQKVEKIGEGTYGVVYKARNKLTGEVVALKKIRLDTETEGVPSTAIREISLLKELNH
PNIVKLLDVIHTENKLYLVFEFLHQDLKKFMDASALTGIPLPLIKSYLFQLLQGLAFCHS
HRVLHRDLKPQNLLINTEGAIKLADFGLARAFGVPVRTYTHEVVTLWYRAPEILLGCKYY
STAVDIWSLGCIFAEMVTRRALFPGDSEIDQLFRIFRTLGTPDEVVWPGVTSMPDYKPSF
PKWARQDFSKVVPPLDEDGRSLLSQMLHYDPNKRISAKAALAHPFFQDVTKPVPHLRL"""

filter_list = ['pfam_accession', 'pdb_id', 'molecule_name', 'ec_number',
               'uniprot_accession_best', 'tax_id']

first_results = run_sequence_search(sequence_to_search, filter_terms=filter_list)

https://www.ebi.ac.uk/pdbe/search/pdb/select?group=true&group.field=pdb_id&group.ngroups=true&json.nl=map&start=0&sort=fasta(e_value) asc&xjoin_fasta=true&bf=fasta(percentIdentity)&xjoin_fasta.external.expupperlim=0.1&xjoin_fasta.external.sequence=
MENFQKVEKIGEGTYGVVYKARNKLTGEVVALKKIRLDTETEGVPSTAIREISLLKELNH
PNIVKLLDVIHTENKLYLVFEFLHQDLKKFMDASALTGIPLPLIKSYLFQLLQGLAFCHS
HRVLHRDLKPQNLLINTEGAIKLADFGLARAFGVPVRTYTHEVVTLWYRAPEILLGCKYY
STAVDIWSLGCIFAEMVTRRALFPGDSEIDQLFRIFRTLGTPDEVVWPGVTSMPDYKPSF
PKWARQDFSKVVPPLDEDGRSLLSQMLHYDPNKRISAKAALAHPFFQDVTKPVPHLRL&q=*:*&fq={!xjoin}xjoin_fasta&fl=pfam_accession,pdb_id,molecule_name,ec_number,uniprot_accession_best,tax_id&wt=json&rows=10
Number of results 10


Print the first result to see what we have

In [3]:
pprint(first_results[0])

{'e_value': 2.4e-76,
 'ec_number': ['2.7.11.22'],
 'molecule_name': ['Cyclin-dependent kinase 2'],
 'pdb_id': '3ezr',
 'percentage_identity': 100.0,
 'pfam_accession': ['PF00069'],
 'tax_id': [9606],
 'uniprot_accession_best': ['P24941']}


Before we do any further analysis we should get a few more results so we can see some patterns.
We are going to increase the number of results to 1000

In [9]:
first_results = run_sequence_search(sequence_to_search,
                                    filter_terms=filter_list,
                                    number_of_rows=1000
                                    )


https://www.ebi.ac.uk/pdbe/search/pdb/select?group=true&group.field=pdb_id&group.ngroups=true&json.nl=map&start=0&sort=fasta(e_value) asc&xjoin_fasta=true&bf=fasta(percentIdentity)&xjoin_fasta.external.expupperlim=0.1&xjoin_fasta.external.sequence=
MENFQKVEKIGEGTYGVVYKARNKLTGEVVALKKIRLDTETEGVPSTAIREISLLKELNH
PNIVKLLDVIHTENKLYLVFEFLHQDLKKFMDASALTGIPLPLIKSYLFQLLQGLAFCHS
HRVLHRDLKPQNLLINTEGAIKLADFGLARAFGVPVRTYTHEVVTLWYRAPEILLGCKYY
STAVDIWSLGCIFAEMVTRRALFPGDSEIDQLFRIFRTLGTPDEVVWPGVTSMPDYKPSF
PKWARQDFSKVVPPLDEDGRSLLSQMLHYDPNKRISAKAALAHPFFQDVTKPVPHLRL&q=*:*&fq={!xjoin}xjoin_fasta&fl=pfam_accession,pdb_id,molecule_name,ec_number,uniprot_accession_best,tax_id&wt=json&rows=1000
Number of results 819


Load the results into a Pandas Dataframe so we can query them

In [16]:
df = pandas_dataset(first_results)

Lets see what we have - you'll see it looks a bit like a spreadsheet or a database

In [17]:
print(df.head())

   ec_number              molecule_name pdb_id   pfam_accession tax_id  \
0  2.7.11.22  Cyclin-dependent kinase 2   3ezr          PF00069   9606   
1  2.7.11.22  Cyclin-dependent kinase 2   5osj          PF00069   9606   
2  2.7.11.22  Cyclin-dependent kinase 2   2r3p          PF00069   9606   
3  2.7.11.22  Cyclin-dependent kinase 2   2vtr          PF00069   9606   
4        NaN                  Cyclin-A2   1ogu  PF00134,PF02984   9606   

  uniprot_accession_best       e_value  percentage_identity  
0                 P24941  1.400000e-75                100.0  
1                 P24941  1.400000e-75                100.0  
2                 P24941  1.400000e-75                100.0  
3                 P24941  1.400000e-75                100.0  
4                 P20248  1.400000e-75                100.0  


There isn't a cut off of eValue or percentage identity in our search
so we should look what the values go to

we can select the column and find the minimum value with .min()

In [20]:
print(df['percentage_identity'].max())
print(df['percentage_identity'].min())

100.0
36.1


same for e value - here we want the min and max


In [19]:
print(df['e_value'].min())
print(df['e_value'].max())

1.1e-19
1.4e-75


We can see that percentage identity drops to as low as 36%
Lets say we want to restrict it to 50%

In [21]:
df2 = df[df['percentage_identity'] > 50]

We stored the results in a new Dataframe called "df2"

In [28]:
print(df2.head())
print('Number of entries in the Dataframe: {}'.format(len(df2)))
print('Max value of percentage identity: {}'.format(df2['percentage_identity'].max()))
print('Min value of percentage identity: {}'.format(df2['percentage_identity'].min()))

   ec_number              molecule_name pdb_id   pfam_accession tax_id  \
0  2.7.11.22  Cyclin-dependent kinase 2   3ezr          PF00069   9606   
1  2.7.11.22  Cyclin-dependent kinase 2   5osj          PF00069   9606   
2  2.7.11.22  Cyclin-dependent kinase 2   2r3p          PF00069   9606   
3  2.7.11.22  Cyclin-dependent kinase 2   2vtr          PF00069   9606   
4        NaN                  Cyclin-A2   1ogu  PF00134,PF02984   9606   

  uniprot_accession_best       e_value  percentage_identity  
0                 P24941  1.400000e-75                100.0  
1                 P24941  1.400000e-75                100.0  
2                 P24941  1.400000e-75                100.0  
3                 P24941  1.400000e-75                100.0  
4                 P20248  1.400000e-75                100.0  
Number of entries in the Dataframe: 441
Max value of percentage identity: 100.0
Min value of percentage identity: 54.2
