# PDBe API Training

## Introduction

This interactive Python notebook will guide you through programmatically accessing Protein Data Bank in Europe (PDBe) data using our REST API.

The REST API is a programmatic way to obtain information from the PDB and EMDB archives. It allows you to access and filter the vast amount of data stored about the protein structures stored in the archives. 

Many types of information about the structures stored are made available through structured data categories. For example, you can access information about:
* sample details
* experimental setup
* model quality
* bound compounds
* assembly formation
* cross-references
* publications
* and much more...

For more information, visit https://pdbe.org/api

If you are brand new to Python (or programming in general), two notebooks covering the basics can be completed in the `0_python_introduction` folder in this tutorial.

---
---

## Setup and Python Fundamentals

First we will import the libraries we need for searching the PDBe via an API. The PDBe's 
API search infrustructure runs on Solr, which can be queried using a dedicated client or 
with Python's default `requests` library, where searches can be 'posted' to the API. 
Here, we will use a dedicated client called [`solrq`](https://pypi.org/project/solrq/), 
which will format our search terms into a Solr-compliant format. There are many other 
Solr clients to choose from but they should all return the same data from the PDBe. 

If you're new to Python and want to learn how to use Python's default `requests` 
library, head to the Jupyter Notebooks in the `extra_api_tutorials` folder. 

Run the cell below by pressing the play button or `Ctrl+Enter`.

In [None]:
from solrq import Q

Next, let's import some additional libraries that we'll need for parsing data and 
sending our API query off to the PDBe. 

Run the cell below by pressing the play button or `Ctrl+Enter`.

In [None]:
from pprint import pprint
import sys
import requests
sys.path.insert(0, '..')

Now, you have access to additional functions and objects that were not, by default, 
loaded into your code. In other words, you have pulled into this file code written in 
`sys`, `pprint` and `requests`, saving you from re-writing them yourself. 

Nevertheless, you can write your own functions in python using the `def` keyword. Below 
is an example of a new function, called `make_request_post` written specifically for 
this notebook, that sends off a query to the PDBe's Solr API and retrieves data.

Run the cell below by pressing the play button or `Ctrl+Enter`.

In [None]:
BASE_URL = "https://www.ebi.ac.uk/pdbe/"  # the beginning of the URL for PDBe's API.
SEARCH_URL = BASE_URL + 'search/pdb/select?'  # the rest of the URL used for PDBe's search API.

def make_request_post(search_dict, number_of_rows=10):
    """
    Makes a get request to the PDBe API
    """
    
    if 'rows' not in search_dict:
        search_dict['rows'] = number_of_rows

    search_dict['wt'] = 'json'
    response = requests.post(SEARCH_URL, data=search_dict)

    if response.status_code == 200:
        return response.json()
    
    else:
        print(f"[No data retrieved - {response.status_code}] {response.text}")

    return {}

#### _Explaining functions_

Functions are an essential component of programming languages and Python is no 
exception. They allow us to create a block of instructions that can be called and 
enacted somewhere else (maybe many places) without us having to re-write every line 
again. Consider the `print` function; it's a block of code that displays variables on a
 terminal that you can call anywhere without needing to think about how it displays that 
variable. By using `def`, we achieve a similar result by writing our own code and then 
calling it anywhere in our script. 

Functions might accept inputs, called 'arguments', whenever they are called. The 
function above accepts two arguments, `search_dict` and `number_of_rows`. By default, we 
have set `number_of_rows` to `10`, but this can be changed when the function is called. 

Functions can also `return` values, which we can set to variables. Below is an abstract 
example:


```python
# Sets a new variable to the result of 2x2x2
eight = cube(2)
```

Now we have defined a function to send a request to the PDBe's API, we need to format 
the fields and terms we intend to perform our search with. This requires some processing
 that could be run every time we send off our query, but it will save us time and lines 
of code to add this to its own function. 

Below is a function to format a set of search terms into an appropriate data structure 
everytime we need to query the PDBe API. In order to use this function elsewhere in the 
notebook, run the code block below to loan it into memory. 

In [None]:
def format_search_terms_post(search_terms, filter_terms=None, **kwargs):
    """
    Formats the search terms for the PDBe API
    """

    # Variable to return
    return_variables = {'q': str(search_terms)}

    if filter_terms:
        fl = ','.join(filter_terms)
        return_variables['fl'] = fl

    for arg in kwargs:
        return_variables[arg] = kwargs[arg]

    return return_variables

Finally, we can package these two functions into a single command (another function...) 
that makes querying as simple as typing `run_search` every time we need data to retrieve
data

In [None]:
def run_search(search_terms, filter_terms=None, number_of_rows=10, **kwargs):
    """
    Run the search with set of search terms
    """

    search_params = format_search_terms_post(search_terms=search_terms, filter_terms=filter_terms)
    
    if search_params:
        response = make_request_post(search_dict=search_params, number_of_rows=number_of_rows)

        if response:
            results = response.get('response', {}).get('docs', [])
            print(f'Number of results for {search_terms}: {len(results)}')
            return results

    print('No results')
    return []

____
---

## Initial testing

Now we are ready to run a search against the PDB API for entries containing Acetylcholinesterase from *Homo sapiens*  from  in the PDB. 

A list of search terms is available at:
https://www.ebi.ac.uk/pdbe/api/doc/search.html

For this task we will search for the molecule name "Acetylcholinesterase" in the PDB.

To run the search above we first need to set the query parameters using the module Q, which stands for query. Once these have been set, we can use the "run_search" function that we defined above to perform the API query and return the results.

By default we have limited the length of the results to 10 rows. 

In [None]:
# Create the Solr search terms object
search_terms = Q(molecule_name='Acetylcholinesterase')

# Run the search
initial_results = run_search(search_terms)
print("Finished")

Not all queries will return valuable results.

What if we try to search for something that doesn't exist?

In [None]:
# The Solr search object can be created with the incorrect search term, 'bob'
search_terms = Q(bob="Acetylcholinesterase")

# Run the erroneous search
bad_results = run_search(search_terms)
print("Finished")

In [None]:
# The search term is now correct, but the value is incorrect
search_terms = Q(molecule_name="bob")

# Run the erroneous search and see what you get
empty_results = run_search(search_terms)
print("Finished")

What if we define the search terms incorrectly? (Hint: This will fail!)

In [None]:
search_terms = Q('bob')
bad_results = run_search(search_terms)

---
---

## Refining the query

We can make the search results more specific by adding additional query parameters.

Here we will try to add organism_name "Homo sapiens" to the query to limit the results to only return those that are structures of the human Acetylcholinesterase.

In [None]:
search_terms = Q(organism_name='Homo sapiens', molecule_name='Acetylcholinesterase')
refined_results = run_search(search_terms)

How did we know which search terms to use?

There are many parameters that can be used to filter the results of a search. To find useful data requires an understanding of the data available.


Exploring the data available is an essential part of the process, all the search terms can be found here:

https://www.ebi.ac.uk/pdbe/api/doc/search.html

For more complicated queries have a look at the documentation:

https://solrq.readthedocs.io/en/latest/index.html

---
---


## Exploring the results

Once a set of results have been obtained, they can be explored in more detail.

We will now look at individual protein structures returned in the refined_results.

The following code returns all the data associated with the first protein structure found in the refined_results. It uses "pprint" (pretty print) to make the results easier to read.

All of the "keys" on the left side of the results can be used as a search term.

In [None]:
pprint(refined_results[0])

There are many terms with the prefixes "q_" and "t_". These are only used for internal processes in PDBe and so can be ignored. 

Below we will find all the search terms that might be useful when querying the data (excludes the "q_" and "t_" search terms).

In [None]:
useful_search_terms = []

for term in refined_results[0].keys():
    if not term.startswith('q_') and not term.startswith('t_'):
        useful_search_terms.append(term)

print(
    f"There are {len(useful_search_terms)} available search terms ",
    '(excluding "q_" and "t_" terms)'
)

and then print out the terms we can use:

In [None]:
pprint(useful_search_terms)

As you can see we get lots of data back about the individual molecule we have searched for and the PDB entries
in which it is contained.

For example, we can get the PDB ID and structure resolution for this first result as follows:

In [None]:
print(f"PDB ID:     {refined_results[0]['pdb_id']}")
print(f"Resolution: {refined_results[0]['resolution']}")

---
---

## Filtering the output data

There are too many different terms to look through so we can restrict the results to only the information we want
using a filter so its easier to see the information we want.

In [None]:
search_terms = Q(molecule_name="Acetylcholinesterase",organism_name="Homo sapiens")
filter_terms = ['pdb_id', 'resolution']

resolution_results = run_search(search_terms, filter_terms)
pprint(resolution_results)

---
---


## Reformatting the results

While we were exploring the data we restricted the number of entries in the output to 10 rows. This allows us to get the results more quickly. Once we have refined our query parameters we can increase this limit.

Now we have a refined query, lets increase the output to 1000 rows. We will then increase the number of rows to 1000 - depending on the search we might get fewer than 1000 results back

**--This fulfils Project Aim 1A--**

In [None]:
search_terms = Q(molecule_name="Acetylcholinesterase",organism_name="Homo sapiens")
filter_terms = ['pdb_id', 'resolution', 'release_year']

huma_ach_results = run_search(search_terms, filter_terms, number_of_rows=1000)
pprint(huma_ach_results)

We are going to use a Python package called Pandas to help us analyse and visualise the 
results. Pandas is a library for handling tabular data and also has built-in graphing 
capability (forked from a package called [`matplotlib`](https://matplotlib.org/stable/)).
Begin by importing the Pandas library using the code below:

In [None]:
# Imports pandas and set its name to pd
import pandas as pd

The `... as pd` part in the code block above resets the name to `pd`. This is a common 
shorthand when using Pandas but is optional. 

The below function reformats the results so that they can be loaded into Pandas:

In [None]:
def pandas_dataset(list_of_results):
    """
    Updates lists to strings for loading into Pandas
    """
    for row in list_of_results:
        for data in row:
            if type(row[data]) == list:
                # If there are any numbers in the list change them into strings
                row[data] = [str(a) for a in row[data]]

                # Unique and sort the list and then change the list into a string
                row[data] = ','.join(sorted(list(set(row[data]))))

    df = pd.DataFrame(list_of_results)
    return df

In [None]:
df_human_ach = pandas_dataset(list_of_results=huma_ach_results)
df_human_ach.head()

We can save the results to a CSV file which we can load into excel

In [None]:
df_human_ach.to_csv("search_results_project_aims_1a.csv")
print('Search results written in to file: search_results_project_aims_1a.csv')

---
---

## Downloading a structure file

Now you have a list of PDB IDs, you can download the structure files for these entries.

In [1]:
import urllib.request
import shutil
from contextlib import closing

pdb_id = '10mh'

url_stem = "https://ftp.ebi.ac.uk/pub/databases/pdb/data/structures/divided/mmCIF/"
url_entry = f"{pdb_id[1:3]}/{pdb_id}.cif.gz"

complete_url = url_stem + url_entry

with closing(urllib.request.urlopen(complete_url)) as r:
    file_save_name = f"{pdb_id}.cif.gz"

    with open(file_save_name, 'wb') as f:
        shutil.copyfileobj(r, f)

The example above works for a single PDB ID, but you can also download multiple files by iterating through a list of PDB IDs. If you're downloading many many files, in the range of hundreds or thousands, you may want to consider using the PDBe's bulk download service instead, which can be found here: [PDBe Bulk Download Service](https://www.ebi.ac.uk/pdbe/download/docs).

---
---

## Analysing and plotting the results

We can use the above results to count how many PDB codes there are for each resolution
This groups PDB IDs by resolution value and then counts the number of unique PDB IDs per method.

In [None]:
df_human_ach.groupby('resolution')['pdb_id'].nunique()

We can then plot these results in a variety of ways using pandas:

In [None]:
def pandas_plot(df, column_to_group_by, graph_type='bar'):
    """
    Function to create a plot from a Pandas DataFrame
    """
    ds = df.groupby(column_to_group_by)['pdb_id'].nunique()
    ds.plot(kind=graph_type)

In [None]:
# Plot resolution as a histogram
df_human_ach.plot('release_year', kind='hist')

In [None]:
# Plot release year as a bar chart
pandas_plot(
    df=df_human_ach,
    column_to_group_by='release_year',
    graph_type='bar'
)

A line plot might make more sense for this data:

In [None]:
# Plot release year as a line chart
pandas_plot(
    df=df_human_ach,
    column_to_group_by='release_year',
    graph_type='line'
)

---
---

## Searching for interacting macromolecules

We can now use what we have learnt to obtain the data other project aims:

**--The following search fulfils Project Aim 1B--**

In [None]:
# Obtain Project Aim 1B results
search_terms = Q(molecule_name="Acetylcholinesterase",organism_name="Homo sapiens")
filter_terms = ['pdb_id','interacting_uniprot_accession']
human_ach_interacting_prots_results = run_search(
    search_terms, 
    filter_terms, 
    number_of_rows = 1000
)

In [None]:
# Reformat and write results to file
df_human_ach_interacting_prots = pandas_dataset(
    list_of_results = human_ach_interacting_prots_results
)
df_human_ach_interacting_prots.to_csv("search_results_project_aims_1b.csv")

print('Search results written in to file: search_results_project_aims_1b.csv')

In [None]:
# Print results
pprint(human_ach_interacting_prots_results)

**--The following search fulfils Project Aim 1C--**

In [None]:
# Obtain Project Aim 1C results
search_terms = Q(molecule_name="Acetylcholinesterase",organism_name="Homo sapiens")
filter_terms = ['pdb_id','interacting_ligands']
human_ach_interacting_ligs_results = run_search(
    search_terms, 
    filter_terms, 
    number_of_rows = 1000
)

In [None]:
# Reformat and write results to file
df_human_ach_interacting_ligs = pandas_dataset(
    list_of_results = human_ach_interacting_ligs_results
)
df_human_ach_interacting_ligs.to_csv("search_results_project_aims_3.csv")

print('Search results written in to file: search_results_project_aims_1c.csv')

In [None]:
# Print results
pprint(human_ach_interacting_ligs_results)

---
---

### Optional extras

Some data is only available through the search API and not the web interface.

One example of this is the additional information made available about antibodies:

In [None]:
search_terms = Q(antibody_flag='Y')
filter_terms = ['antibody_name', 'antibody_species', 'pdb_id']

antibody_results = run_search(
    search_terms, 
    filter_terms=filter_terms, 
    number_of_rows=1000000
)

With this data we can explore it by grouping the column values:

In [None]:
df_antibody_results = pandas_dataset(antibody_results)
print(df_antibody_results)

# Count number of entries containing an antibody
ds_antibody_entries = df_antibody_results.groupby('pdb_id').count()
print(
    f"""
    Number of antibody entries: {len(ds_antibody_entries)}
    """
)

# Count all the species which an antibody has been obtained from 
ds_antibody_species = df_antibody_results.groupby('antibody_species').count()
print(
    f"""
    Antibody entries broken down by species: 
    {ds_antibody_species['antibody_name']}
    """
)