# PDBe API Training

### PDBe search

This interactive Python notebook will guide you through programmatically accessing Protein Data Bank in Europe (PDBe)
data using our REST API

The REST API is a programmatic way to obtain information from the PDB and EMDB. You can access details about:

* sample
* experiment
* models
* compounds
* cross-references
* publications
* quality
* assemblies
* search
and more...
For more information, visit https://pdbe.org/api


This tutorial will guide you through searching PDBe programmatically.


First we will import the code which will do the work
Run the cell below - by pressing the green play button.

In [5]:
from pprint import pprint
from solrq import Q, Range
import sys
sys.path.insert(0,'..') # to ensure the below import works in all Jupyter notebooks
from python_modules.api_modules import run_search, pandas_dataset, pandas_count, pandas_plot, pandas_plot_multi_groupby

Now we are ready to actually run a search against the PDB API for entries containing human Dihydrofolate reductase in the PDB. This will return a list of results - only 10 to start with.

A list of search terms is available at:
https://www.ebi.ac.uk/pdbe/api/doc/search.html

We are going to search for the molecule name "Dihydrofolate reductase" in the PDB

The search terms are defined using a module called Q

We have imported a function called "run_search" that will do the search for us.
We have limited this to 10 results and we will print finished at the end to show its complete

In [6]:
print('1st search - limited to 10 results')

search_terms = Q(molecule_name='Dihydrofolate reductase')

first_results = run_search(search_terms)
print('finished')

1st search - limited to 10 results
Number of results for molecule_name:Dihydrofolate\ reductase: 10
finished


what if we try to search for something that doesn't exist

In [None]:
print('Getting the search wrong')

search_terms = Q(something_that_does_not_exist="Dihydrofolate reductase")
false_results = run_search(search_terms)

In [None]:
search_terms = Q(molecule_name="bob")
false_results2 = run_search(search_terms)

or we define our search terms incorrectly (this will fail)

In [None]:
search_terms = Q('bob')
false_results3 = run_search(search_terms)

We will add organism_name of Human to the query to limit the results to only return those that are structures of Human Dihydrofolate reductase.

In [None]:
print('2nd search - two terms together')
search_terms = Q(organism_name="Human",molecule_name="Dihydrofolate reductase")
second_results = run_search(search_terms)


For more complicated queries have a look at the documentation
https://solrq.readthedocs.io/en/latest/index.html

How did we know which search terms to use?

We will then look at the results of the last search.

We will look at the first result with the following command
i.e. second_results[0]

We are going to use "pprint" (pretty print) rather than "print" to make the result easier to read.

All of the "keys" on the left side of the results can be used as a search term.

In [None]:
pprint(second_results[0])

We use terms prefixed with q_ and t_ for our internal use so we can exclude the
 list of available search terms is available using the following command and then see how many search terms there are

In [None]:
keys_without_q = [q for q in second_results[0].keys() if not (q.startswith('q_') or (q.startswith('t_')))]
print('There are {} available search terms'.format(len(keys_without_q)))

and then print out the terms we can use

In [None]:
pprint(keys_without_q)

As you can see we get lots of data back about the individual molecule we have searched for and the PDB entries
in which it is contained.

We can get the PDB ID and experimental method for this first result as follows.

In [None]:
print(second_results[0].get('pdb_id'))
print(second_results[0].get('experimental_method'))

Note that the experimental method is a list as there can be more than one experimental method per entry

There are too many different terms to look through so we can restrict the results to only the information we want
using a filter so its easier to see the information we want.

In [None]:
print('3rd search')
search_terms = Q(molecule_name="Dihydrofolate reductase",organism_name="Human")
filter_terms = ['pdb_id', 'experimental_method']
third_results = run_search(search_terms, filter_terms)
pprint(third_results)

We are still restricting the number of entries to 10 so we get the results quickly

We will then increase the number of rows to 1000 - depending on the search we might get fewer than 1000 results back

In [None]:
print('Search with 1000 rows')
search_terms = Q(molecule_name="Dihydrofolate reductase",organism_name="Human")
filter_terms = ['pdb_id', 'experimental_method', 'release_year']

third_results_more_rows = run_search(search_terms, filter_terms, number_of_rows=1000)
pprint(third_results_more_rows)

We are going to use a Python package called Pandas to help us analyse and visualise the results

In [None]:
df = pandas_dataset(list_of_results=third_results_more_rows)
print(df)

We can save the results to a CSV file which we can load into excel

In [None]:
df.to_csv("search_results.csv")

We can use the this to count how many PDB codes there are for each experimental method
This groups PDB IDs by experimental method and then counts the number of unique PDB IDs per method.

In [None]:
pandas_count(df=df,
             column_to_group_by='experimental_method')

or plot the results

In [None]:
pandas_plot(df=df,
            column_to_group_by='experimental_method',
            graph_type='bar'
            )

or plot per release year

In [None]:
pandas_plot(df=df,
            column_to_group_by='release_year',
            graph_type='bar'
            )

maybe a line plot makes more sense here

In [None]:
pandas_plot(df=df,
            column_to_group_by='release_year',
            graph_type='line'
            )

Maybe we've heard that Electron Microscopy is taking over and we want to see if this is true

We will filter out all hybrid methods.

In [None]:
search_terms = Q(release_year=Range(2000, 2019))
filter_results = ['experimental_method','release_year', 'pdb_id']
results = run_search(search_terms, filter_results, number_of_rows=100)

df = pandas_dataset(results)

# filter out all hybrid methods
df = df[~df['experimental_method'].str.contains(',')]
pandas_plot_multi_groupby(df, 'release_year', 'experimental_method')

To get the full list of results we need to increase the number of results we get back

The third line filters out all hybrid methods which makes the graph easier to see.

In [None]:
results = run_search(search_terms, filter_results, number_of_rows=100000)
# filter out all hybrid methods
df = pandas_dataset(results)
df = df[~df['experimental_method'].str.contains(',')]
pandas_plot_multi_groupby(df, 'release_year', 'experimental_method')


Some data is only available through the search API and not the web interface.
An example of this is information about antibodies.  

In [None]:
search_terms = Q(antibody_flag='Y')
filter_terms = ['antibody_flag', 'antibody_name', 'antibody_species', 'pdb_id']
api_only_results1 = run_search(search_terms, filter_terms=filter_terms, number_of_rows=1000000)
print(len(api_only_results1))

In [None]:
df = pandas_dataset(api_only_results1)
print(df)
ds = df.groupby('pdb_id').count()
print(len(ds))
#ds = df.groupby('antibody_species').count().sort_values('antibody_flag', ascending=False)
#print(ds)
#ds.to_csv('output.csv')
