# PDBe API Training

### PDBe search

Searching with a sequence

In [1]:
from pprint import pprint # used for pretty printing
import sys
sys.path.insert(0,'..') # to ensure the below import works in all Jupyter notebooks
from python_modules.api_modules import run_sequence_search, explode_dataset

We will search for a sequence with an example sequence from luciferase from Photinus pyralis (Common eastern firefly)

In [2]:
sequence_to_search = """
MEDAKNIKKGPAPFYPLEDGTAGEQLHKAMKRYALVPGTIAFTDAHIEVNITYAEYFEMS
VRLAEAMKRYGLNTNHRIVVCSENSLQFFMPVLGALFIGVAVAPANDIYNERELLNSMNI
SQPTVVFVSKKGLQKILNVQKKLPIIQKIIIMDSKTDYQGFQSMYTFVTSHLPPGFNEYD
FVPESFDRDKTIALIMNSSGSTGLPKGVALPHRTACVRFSHARDPIFGNQIIPDTAILSV
VPFHHGFGMFTTLGYLICGFRVVLMYRFEEELFLRSLQDYKIQSALLVPTLFSFFAKSTL
IDKYDLSNLHEIASGGAPLSKEVGEAVAKRFHLPGIRQGYGLTETTSAILITPEGDDKPG
AVGKVVPFFEAKVVDLDTGKTLGVNQRGELCVRGPMIMSGYVNNPEATNALIDKDGWLHS
GDIAYWDEDEHFFIVDRLKSLIKYKGYQVAPAELESILLQHPNIFDAGVAGLPDDDAGEL
PAAVVVLEHGKTMTEKEIVDYVASQVTTAKKLRGGVVFVDEVPKGLTGKLDARKIREILI
KAKKGGKSKL
"""

In [3]:
filter_list = ['pfam_accession', 'pdb_id', 'molecule_name', 'ec_number',
               'uniprot_accession_best', 'tax_id']

first_results = run_sequence_search(sequence_to_search, filter_terms=filter_list)

Number of results 10


Print the first result to see what we have

In [4]:
pprint(first_results[0])

{'chain_id': 'A',
 'e_value': 0.0,
 'ec_number': ['1.13.12.7'],
 'entity_id': 1,
 'entry_entity': '3ies_1',
 'molecule_name': ['Luciferin 4-monooxygenase'],
 'pdb_id': '3ies',
 'percentage_identity': 100.0,
 'pfam_accession': ['PF00501', 'PF13193'],
 'result_sequence': None,
 'tax_id': [7054],
 'uniprot_accession_best': ['P08659']}


Before we do any further analysis we should get a few more results so we can see some patterns.
We are going to increase the number of results to 1000 to make sure we catch everything.

In [5]:
first_results = run_sequence_search(sequence_to_search,
                                    filter_terms=filter_list,
                                    number_of_rows=1000
                                    )


Number of results 222


Load the results into a Pandas Dataframe so we can query them

In [7]:
df = explode_dataset(first_results)


Lets see what we have - you'll see it looks a bit like a spreadsheet or a database

Note the same PDB code repeats lots of times as we exploded the results

In [9]:
df.head()

Unnamed: 0,chain_id,ec_number,entity_id,entry_entity,molecule_name,pdb_id,pfam_accession,tax_id,uniprot_accession_best,e_value,percentage_identity,result_sequence
0,A,1.13.12.7,1,3ies_1,Luciferin 4-monooxygenase,3ies,PF00501,7054,P08659,0.0,100.0,
1,A,1.13.12.7,1,3ies_1,Luciferin 4-monooxygenase,3ies,PF13193,7054,P08659,0.0,100.0,
2,A,1.13.12.7,1,5kyt_1,Luciferin 4-monooxygenase,5kyt,PF00501,7054,P08659,0.0,99.8,
3,A,1.13.12.7,1,5kyt_1,Luciferin 4-monooxygenase,5kyt,PF13193,7054,P08659,0.0,99.8,
4,B,1.13.12.7,1,5kyt_1,Luciferin 4-monooxygenase,5kyt,PF00501,7054,P08659,0.0,99.8,


We can save the results to a CSV file which we can load into excel

In [8]:
df.to_csv("search_results.csv")

There isn't a cut off of eValue or percentage identity in our search
so we should look what the values go to

we can select the column and find the minimum value with .min() or maximum value with .max()

In [10]:
df['percentage_identity'].max()

100.0

In [11]:
df['percentage_identity'].min()

20.0

same for e value - here we want the min and max


In [11]:
df['e_value'].min()

2.9e-76

In [12]:
df['e_value'].max()

5.5e-20

We can see that percentage identity drops to as low as 20%
Lets say we want to restrict it to 80%

In [12]:
df2 = df.query('percentage_identity > 80')

We stored the results in a new Dataframe called "df2"

In [13]:
df2.head()

Unnamed: 0,chain_id,ec_number,entity_id,entry_entity,molecule_name,pdb_id,pfam_accession,tax_id,uniprot_accession_best,e_value,percentage_identity,result_sequence
0,A,1.13.12.7,1,3ies_1,Luciferin 4-monooxygenase,3ies,PF00501,7054,P08659,0.0,100.0,
1,A,1.13.12.7,1,3ies_1,Luciferin 4-monooxygenase,3ies,PF13193,7054,P08659,0.0,100.0,
2,A,1.13.12.7,1,5kyt_1,Luciferin 4-monooxygenase,5kyt,PF00501,7054,P08659,0.0,99.8,
3,A,1.13.12.7,1,5kyt_1,Luciferin 4-monooxygenase,5kyt,PF13193,7054,P08659,0.0,99.8,
4,B,1.13.12.7,1,5kyt_1,Luciferin 4-monooxygenase,5kyt,PF00501,7054,P08659,0.0,99.8,


Number of entries in the Dataframe

In [15]:
len(df2)

621

Max value of percentage identity

In [12]:
df2['percentage_identity'].max()

100.0

Min value of percentage identity

In [13]:
df2['percentage_identity'].min()

84.2

How many unique Pfam domains or UniProts did we get back?

We can group the results by Pfam using "groupby" and then counting the results

In [21]:
df2.groupby('pfam_accession').count().sort_values('pdb_id', ascending=False)

Unnamed: 0_level_0,chain_id,ec_number,entity_id,entry_entity,molecule_name,pdb_id,tax_id,uniprot_accession_best,e_value,percentage_identity,result_sequence
pfam_accession,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
PF00501,25,25,25,25,25,25,25,25,25,25,0
PF13193,25,25,25,25,25,25,25,25,25,25,0


same for uniprot accession
This time we will sort the values by the number of PDB entries ("pdb_id"'s) they appear in.

In [19]:
group_by_uniprot = df2.groupby('uniprot_accession_best').count().sort_values('pdb_id', ascending=False)

Then lets have a look at what we have

In [22]:
group_by_uniprot

Unnamed: 0_level_0,chain_id,ec_number,entity_id,entry_entity,molecule_name,pdb_id,pfam_accession,tax_id,e_value,percentage_identity,result_sequence
uniprot_accession_best,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
P08659,48,48,48,48,48,48,46,48,48,48,0
Q5UFR2,4,4,4,4,4,4,4,4,4,4,0


In this case the most common UniProt accession is P24941.
How many UniProt accessions were there?

In [20]:
len(group_by_uniprot)

109

How many are enzymes? We can use "ec_number" to work see how many have E.C. numbers

In [23]:
uniprot_with_ec = group_by_uniprot.query('ec_number != 0')

In [24]:
len(uniprot_with_ec)

2