# PDBe API Training

### PDBe search

This interactive Python notebook will guide you through programmatically accessing Protein Data Bank in Europe (PDBe)
data using our REST API

The REST API is a programmatic way to obtain information from the PDB and EMDB. You can access details about:

* sample
* experiment
* models
* compounds
* cross-references
* publications
* quality
* assemblies
* search
and more...
For more information, visit https://pdbe.org/api


This tutorial will guide you through searching PDBe programmatically.


First we will import the code which will do the work
Run the cell below - by pressing the green play button.

In [None]:
from pprint import pprint
from solrq import Q, Range
import sys
sys.path.insert(0,'..')
from tutorial_utilities.api_modules import (
    pandas_dataset, 
    pandas_count, 
    pandas_plot, 
    pandas_plot_multi_groupby
)

Now we are ready to actually run a search against the PDB API for entries containing Acetylcholinesterase from *Homo sapiens*  from  in the PDB. This will return a list of results - only 10 to start with.

A list of search terms is available at:
https://www.ebi.ac.uk/pdbe/api/doc/search.html

We are going to search for the molecule name "Acetylcholinesterase" in the PDB

The search terms are defined using a module called Q

We have imported a function called "run_search" that will do the search for us.
We have limited this to 10 results and we will print finished at the end to show its complete

In [None]:
def run_search(search_terms, filter_terms=None, number_of_rows=10, **kwargs):
    """
    Run the search with set of search terms
    :param str search_terms: string of search terms
    :param list filter_terms: list of terms to filter by
    :param int number_of_rows: number of search rows to return
    :return lst: list of results
    """
    search_params = format_search_terms_post(search_terms=search_terms, filter_terms=filter_terms)
    if search_params:
        response = make_request_post(search_dict=search_params, number_of_rows=number_of_rows)
        if response:
            results = response.get('response', {}).get('docs', [])
            print('Number of results for {}: {}'.format(search_terms, len(results)))
            return results

    print('No results')
    return []

In [None]:
# Create the Solr search object
search_terms = Q(molecule_name='Acetylcholinesterase')

# Run the search
first_results = run_search(search_terms)
print("Finished")

What if we try to search for something that doesn't exist

In [None]:
# The Solr search object can be created with the incorrect keyterm, 'bob'
search_terms = Q(bob="Acetylcholinesterase")

# Run the erroneous search
false_results = run_search(search_terms)
print("Finished")

In [None]:
# The keyterm is now correct, but the value is incorrect
search_terms = Q(molecule_name="bob")

# Run the erroneous search and see what you get
empty_results = run_search(search_terms)

empty_results

or we define our search terms incorrectly (this will fail)

In [None]:
search_terms = Q('bob')
false_results3 = run_search(search_terms)

We will add organism_name of *Homo sapiens* to the query to limit the results to only return those that are structures of the human Acetylcholinesterase.

In [None]:
print('2nd search - two terms together')
search_terms = Q(organism_name='Homo sapiens', molecule_name='Acetylcholinesterase')
second_results = run_search(search_terms)


For more complicated queries have a look at the documentation
https://solrq.readthedocs.io/en/latest/index.html

How did we know which search terms to use?

We will then look at the results of the last search.

We will look at the first result with the following command
i.e. second_results[0]

We are going to use "pprint" (pretty print) rather than "print" to make the result easier to read.

All of the "keys" on the left side of the results can be used as a search term.

In [None]:
pprint(second_results[0])

We use terms prefixed with q_ and t_ for our internal use so we can exclude the
 list of available search terms is available using the following command and then see how many search terms there are

In [None]:
keys_without_q = [q for q in second_results[0].keys() if not (q.startswith('q_') or (q.startswith('t_')))]
print('There are {} available search terms'.format(len(keys_without_q)))

and then print out the terms we can use

In [None]:
pprint(keys_without_q)

As you can see we get lots of data back about the individual molecule we have searched for and the PDB entries
in which it is contained.

We can get the PDB ID and structure resolution for this first result as follows.

In [None]:
print(second_results[0].get('pdb_id'))
print(second_results[0].get('resolution'))

There are too many different terms to look through so we can restrict the results to only the information we want
using a filter so its easier to see the information we want.

In [None]:
print('3rd search')
search_terms = Q(molecule_name="Acetylcholinesterase",organism_name="Homo sapiens")
filter_terms = ['pdb_id', 'resolution']
third_results = run_search(search_terms, filter_terms)
pprint(third_results)

We are still restricting the number of entries to 10 so we get the results quickly

We will then increase the number of rows to 1000 - depending on the search we might get fewer than 1000 results back

In [None]:
print('Project aims 1: Search all the structures of Human Acetylcholinesterase, Search with up to 1000 rows')
search_terms = Q(molecule_name="Acetylcholinesterase",organism_name="Homo sapiens")
filter_terms = ['pdb_id', 'resolution', 'release_year']
third_results_more_rows = run_search(search_terms,filter_terms, number_of_rows=1000)
pprint(third_results_more_rows)

We are going to use a Python package called Pandas to help us analyse and visualise the results

In [None]:
df = pandas_dataset(list_of_results=third_results_more_rows)
print(df)

We can save the results to a CSV file which we can load into excel

In [None]:
df.to_csv("search_results_project_aims_1.csv")
print('Search results with structures of Human Acetylcholinesterase written in filename:search_results_project_aims_1.csv')

We can use the this to count how many PDB codes there are for each resolution
This groups PDB IDs by resolution value and then counts the number of unique PDB IDs per method.

In [None]:
pandas_count(df=df,
             column_to_group_by='resolution')

or plot the results as a histogram

In [None]:
pandas_plot(df=df,
            column_to_group_by='resolution',
            graph_type='hist'
            )

or plot per release year

In [None]:
pandas_plot(df=df,
            column_to_group_by='release_year',
            graph_type='bar'
            )

In [None]:

pandas_plot(df=df,
            column_to_group_by='release_year',
            graph_type='line'
            )

maybe a line plot makes more sense here

In [None]:
print('Project aims 2- Searching all the interacting macromolecules')
search_terms = Q(molecule_name="Acetylcholinesterase",organism_name="Homo sapiens")
filter_terms = ['pdb_id','interacting_uniprot_accession']
fourth_results = run_search(search_terms, filter_terms, number_of_rows = 1000)
#pprint(fourth_results)
df4 = pandas_dataset(list_of_results=fourth_results)
df4.to_csv("search_results_project_aims_2.csv")
print('Search results with interacting macromolecules written in filename:search_results_project_aims_2.csv')

In [None]:
print('Project aims 2- Searching all the interacting ligands')
search_terms = Q(molecule_name="Acetylcholinesterase",organism_name="Homo sapiens")
filter_terms = ['pdb_id','interacting_ligands']
fifth_results = run_search(search_terms, filter_terms, number_of_rows = 1000)
#pprint(fourth_results)
df5 = pandas_dataset(list_of_results=fifth_results)
df5.to_csv("search_results_project_aims_3.csv")
print('Search results with interacting ligands/small molecules written in filename:search_results_project_aims_3.csv')

Some data is only available through the search API and not the web interface.
An example of this is information about antibodies.  

In [None]:
search_terms = Q(antibody_flag='Y')
filter_terms = ['antibody_flag', 'antibody_name', 'antibody_species', 'pdb_id']
api_only_results1 = run_search(search_terms, filter_terms=filter_terms, number_of_rows=1000000)
print(len(api_only_results1))

In [None]:
df = pandas_dataset(api_only_results1)
print(df)
ds = df.groupby('pdb_id').count()
print(len(ds))
#ds = df.groupby('antibody_species').count().sort_values('antibody_flag', ascending=False)
#print(ds)
#ds.to_csv('output.csv')
