# PDBe API Training

### Introduction

This interactive Python notebook will guide you through programmatically accessing Protein Data Bank in Europe (PDBe) data using our REST API.

The REST API is a programmatic way to obtain information from the PDB and EMDB archives. It allows you to access and filter the vast amount of data stored about the protein structures stored in the archives. 

Many types of information about the structures stored are made available through structured data categories. For example, you can access information about:
* sample details
* experimental setup
* model quality
* bound compounds
* assembly formation
* cross-references
* publications
* and much more...

For more information, visit https://pdbe.org/api



### Setup

First we will import the code which is required to search the API and plot the results.

Run the cell below - by pressing the play button.

In [None]:
from pprint import pprint
from solrq import Q, Range
import sys
sys.path.insert(0,'..')
from tutorial_utilities.api_modules import (
    format_search_terms_post,
    make_request_post,
    pandas_dataset, 
    pandas_count, 
    pandas_plot, 
    pandas_plot_multi_groupby   
)

In [None]:
def run_search(search_terms, filter_terms=None, number_of_rows=10, **kwargs):
    """
    Run the search with set of search terms
    :param str search_terms: string of search terms
    :param list filter_terms: list of terms to filter by
    :param int number_of_rows: number of search rows to return
    :return lst: list of results
    """
    search_params = format_search_terms_post(search_terms=search_terms, filter_terms=filter_terms)
    if search_params:
        response = make_request_post(search_dict=search_params, number_of_rows=number_of_rows)
        if response:
            results = response.get('response', {}).get('docs', [])
            print('Number of results for {}: {}'.format(search_terms, len(results)))
            return results

    print('No results')
    return []

### Initial testing

Now we are ready to run a search against the PDB API for entries containing Acetylcholinesterase from *Homo sapiens*  from  in the PDB. 

A list of search terms is available at:
https://www.ebi.ac.uk/pdbe/api/doc/search.html

For this task we will search for the molecule name "Acetylcholinesterase" in the PDB.

To run the search above we first need to set the query parameters using the module Q, which stands for query. Once these have been set, we can use the "run_search" function that we defined above to perform the API query and return the results.

By default we have limited the length of the results to 10 rows. 

In [None]:
# Create the Solr search terms object
search_terms = Q(molecule_name='Acetylcholinesterase')

# Run the search
initial_results = run_search(search_terms)
print("Finished")

Not all queries will return valuable results.

What if we try to search for something that doesn't exist?

In [None]:
# The Solr search object can be created with the incorrect search term, 'bob'
search_terms = Q(bob="Acetylcholinesterase")

# Run the erroneous search
bad_results = run_search(search_terms)
print("Finished")

In [None]:
# The search term is now correct, but the value is incorrect
search_terms = Q(molecule_name="bob")

# Run the erroneous search and see what you get
empty_results = run_search(search_terms)
print("Finished")

What if we define the search terms incorrectly? (Hint: This will fail!)

In [None]:
search_terms = Q('bob')
bad_results = run_search(search_terms)

### Refining the query

We can make the search results more specific by adding additional query parameters.

Here we will try to add organism_name "Homo sapiens" to the query to limit the results to only return those that are structures of the human Acetylcholinesterase.

In [None]:
search_terms = Q(organism_name='Homo sapiens', molecule_name='Acetylcholinesterase')
refined_results = run_search(search_terms)

How did we know which search terms to use?

There are many parameters that can be used to filter the results of a search. To find useful data requires an understanding of the data available.


Exploring the data available is an essential part of the process, all the search terms can be found here:

https://www.ebi.ac.uk/pdbe/api/doc/search.html

For more complicated queries have a look at the documentation:

https://solrq.readthedocs.io/en/latest/index.html


### Exploring the results

Once a set of results have been obtained, they can be explored in more detail.

We will now look at individual protein structures returned in the refined_results.

The following code returns all the data associated with the first protein structure found in the refined_results. It uses "pprint" (pretty print) to make the results easier to read.

All of the "keys" on the left side of the results can be used as a search term.

In [None]:
pprint(refined_results[0])

There are many terms with the prefixes "q_" and "t_". These are only used for internal processes in PDBe and so can be ignored. 

Below we will find all the search terms that might be useful when querying the data (excludes the "q_" and "t_" search terms).

In [None]:
useful_search_terms = []
for term in refined_results[0].keys():
    if not term.startswith('q_') and not term.startswith('t_'):
        useful_search_terms.append(term)
           
print(f'There are {len(useful_search_terms)} available search terms (excluding "q_" and "t_" terms)')

and then print out the terms we can use

In [None]:
pprint(useful_search_terms)

As you can see we get lots of data back about the individual molecule we have searched for and the PDB entries
in which it is contained.

For example, we can get the PDB ID and structure resolution for this first result as follows:

In [None]:
print(f"PDB ID:     {refined_results[0].get('pdb_id')}")
print(f"Resolution: {refined_results[0].get('resolution')}")

### Filtering the output data

There are too many different terms to look through so we can restrict the results to only the information we want
using a filter so its easier to see the information we want.

In [None]:
search_terms = Q(molecule_name="Acetylcholinesterase",organism_name="Homo sapiens")
filter_terms = ['pdb_id', 'resolution']
resolution_results = run_search(search_terms, filter_terms)
pprint(resolution_results)

### Reformatting the results

While we were exploring the data we restricted the number of entries in the output to 10 rows. This allows us to get the results more quickly. Once we have refined our query parameters we can increase this limit.

Now we have a refined query, lets increase the output to 1000 rows. We will then increase the number of rows to 1000 - depending on the search we might get fewer than 1000 results back

**--This fulfils Project Aim 1A--**

In [None]:
search_terms = Q(molecule_name="Acetylcholinesterase",organism_name="Homo sapiens")
filter_terms = ['pdb_id', 'resolution', 'release_year']
project_aim_1a_results = run_search(search_terms,filter_terms, number_of_rows=1000)
pprint(project_aim_1a_results)

We are going to use a Python package called Pandas to help us analyse and visualise the results

In [None]:
df_1a_results = pandas_dataset(list_of_results=project_aim_1a_results)
print(df_1a_results)

We can save the results to a CSV file which we can load into excel

In [None]:
df_1a_results.to_csv("search_results_project_aims_1a.csv")
print('Search results written in to file: search_results_project_aims_1a.csv')

### Analysing and plotting the results

We can use the this to count how many PDB codes there are for each resolution
This groups PDB IDs by resolution value and then counts the number of unique PDB IDs per method.

In [None]:
pandas_count(df=df_1a_results,
             column_to_group_by='resolution')

We can then plot these results in a variety of ways using pandas:

In [None]:
# Plot resolution as a histogram
pandas_plot(df=df_1a_results,
            column_to_group_by='resolution',
            graph_type='hist'
            )

In [None]:
# Plot release year as a bar chart
pandas_plot(df=df_1a_results,
            column_to_group_by='release_year',
            graph_type='bar'
            )

A line plot might make more sense for this data:

In [None]:
# Plot release year as a line chart
pandas_plot(df=df_1a_results,
            column_to_group_by='release_year',
            graph_type='line'
            )

### Searching for interacting macromolecules

We can now use what we have learnt to obtain the data other project aims:

**--The following search fulfils Project Aim 1B--**

In [None]:
# Obtain Project Aim 1B results
search_terms = Q(molecule_name="Acetylcholinesterase",organism_name="Homo sapiens")
filter_terms = ['pdb_id','interacting_uniprot_accession']
project_aim_1b_results = run_search(search_terms, filter_terms, number_of_rows = 1000)

In [None]:
# Reformat and write results to file
df_1b_results = pandas_dataset(list_of_results=project_aim_1b_results)
df_1b_results.to_csv("search_results_project_aims_1b.csv")
print('Search results written in to file: search_results_project_aims_1b.csv')

In [None]:
# Print results
pprint(project_aim_1b_results)

**--The following search fulfils Project Aim 1C--**

In [None]:
# Obtain Project Aim 1C results
search_terms = Q(molecule_name="Acetylcholinesterase",organism_name="Homo sapiens")
filter_terms = ['pdb_id','interacting_ligands']
project_aim_1c_results = run_search(search_terms, filter_terms, number_of_rows = 1000)

In [None]:
# Reformat and write results to file
df_1c_results = pandas_dataset(list_of_results=project_aim_1c_results)
df_1c_results.to_csv("search_results_project_aims_3.csv")
print('Search results written in to file: search_results_project_aims_1c.csv')

In [None]:
# Print results
pprint(project_aim_1c_results)

### Optional extras

Some data is only available through the search API and not the web interface.

One example of this is the additional information made available about antibodies:

In [None]:
search_terms = Q(antibody_flag='Y')
filter_terms = ['antibody_name', 'antibody_species', 'pdb_id']
antibody_results = run_search(search_terms, filter_terms=filter_terms, number_of_rows=1000000)

With this data we can explore it by grouping the column values:

In [None]:
df_antibody_results = pandas_dataset(antibody_results)
print(df_antibody_results)

# Count number of entries containing an antibody
ds_antibody_entries = df_api_only_results.groupby('pdb_id').count()
print(
f"""
Number of antibody entries: {len(ds_antibody_entries)}
"""
)

# Count all the species which an antibody has been obtained from 
ds_antibody_species = df_antibody_results.groupby('antibody_species').count()
print(
f"""
Antibody entries broken down by species: 
{ds_antibody_species['antibody_name']}
"""
)