# PDBe API Training

This interactive Python notebook will guide you through various ways of programmatically accessing Protein Data Bank in Europe (PDBe) data using REST API

The REST API is a programmatic way to obtain information from the PDB and EMDB. You can access details about:

* sample
* experiment
* models
* compounds
* cross-references
* publications
* quality
* assemblies
and more...
For more information, visit https://www.ebi.ac.uk/pdbe/pdbe-rest-api

# Notebook #6
This notebook is the second in the training material series, and focuses on getting information for multiple PDB entries using the REST search API of PDBe.

## 1) Making imports and setting variables
First, we import some packages that we will use, and set some variables.

Note: Full list of valid URLs is available from https://www.ebi.ac.uk/pdbe/api/doc/

In [16]:
import requests
from pprint import pprint

base_url = "https://www.ebi.ac.uk/pdbe/"

api_base = base_url + "api/"

search_url = base_url + 'search/pdb/select?'

## 2) Defining request function
Let's start with defining a function that can be used to GET data from the PDBe search API.


In [11]:
def make_request(search_term, number_of_rows=10):
    """
    This function can make GET requests to
    the PDBe search API
    
    :param url: String,
    :param pdb_id: String
    :return: JSON or None
    """
    search_variables = '&wt=json&rows={}'.format(number_of_rows)
    url = search_url+search_term+search_variables
    print(url)
    response = requests.get(url)

    if response.status_code == 200:
        return response.json()
    else:
        print("[No data retrieved - %s] %s" % (response.status_code, response.text))
    
    return {}

## 3) formatting the search terms changing and to &
This will update the search and make sure that the search API can manage the URL

In [18]:
def format_search_terms(search_terms):
    print('formatting search terms: %s' % search_terms)
    search_string = ''
    search_list = []
    if isinstance(search_terms, dict):
        for key in search_terms:
            term = search_terms.get(key)
            if ' ' in term:
                if not '"' in term:
                    term = '"{}"'.format(term)
                elif not "'" in term:
                    term = "'{}'".format(term)
            search_list.append('{}:{}'.format(key, term))
        search_string = ' AND '.join(search_list)
    else:
        if '&' in search_terms:
            search_string = search_terms.replace('&', ' AND ')
        else:
            search_string = search_terms
    print('formatted search terms: %s' % search_string)
    return 'q={}'.format(search_string)

## 4) Defining the search function

The search will be done by this function and will return a list of results

In [7]:
def run_search(search_terms, number_of_rows=10):
    search_term = format_search_terms(search_terms)

    response = make_request(search_term, number_of_rows)
    return response.get('response', {}).get('docs', {})
    

## 5) running a search

Now we are ready to run a search for entries containing human Dihydrofolate reductase in the PDB. This will return a list of results - only 10 to start with.

A list of search terms is available at:
https://www.ebi.ac.uk/pdbe/api/doc/search

This will return details of each human Dihydrofolate reductase in the PDB

In [19]:
search_terms = {"molecule_name":"Dihydrofolate reductase",
                "organism_scientific_name":"Homo sapiens"
               } 
# can we search for human?
# can we do better than raw text here? what modules are available for searching for us?
results = run_search(search_terms)
print('Number of results: {}'.format(len(results)))


formatting search terms: {'molecule_name': 'Dihydrofolate reductase', 'organism_scientific_name': 'Homo sapiens'}
formatted search terms: molecule_name:"Dihydrofolate reductase" AND organism_scientific_name:"Homo sapiens"
https://www.ebi.ac.uk/pdbe/search/pdb/select?q=molecule_name:"Dihydrofolate reductase" AND organism_scientific_name:"Homo sapiens"&wt=json&rows=10
Number of results: 10


We will then look at the first result.
We are using "pprint.pprint" rather than "print" to make the result easier to read.

In [21]:
pprint(results[0])

{'_version_': 1638982063406186496,
 'all_assembly_composition': ['protein structure'],
 'all_assembly_form': ['homo'],
 'all_assembly_id': ['1'],
 'all_assembly_mol_wt': [22.485],
 'all_assembly_type': ['monomer'],
 'all_authors': ['Cody V', 'Gangjee A'],
 'all_compound_names': ['NDP : NADPH DIHYDRO-NICOTINAMIDE-ADENINE-DINUCLEOTIDE '
                        'PHOSPHATE',
                        '65Q : '
                        '5-methyl-6-{[3-(trifluoromethoxy)phenyl]sulfanyl}thieno[2,3-d]pyrimidine-2,4-diamine',
                        '65Q : '
                        '5-methyl-6-{[3-(trifluoromethoxy)phenyl]sulfanyl}thieno[2,3-d]pyrimidine-2,4-diamine',
                        '65Q : '
                        '5-methyl-6-[3-(trifluoromethyloxy)phenyl]sulfanyl-thieno[2,3-d]pyrimidine-2,4-diamine',
                        'NDP : '
                        '[[(2R,3S,4R,5R)-5-(3-aminocarbonyl-4H-pyridin-1-yl)-3,4-dihydroxy-oxolan-2-yl]methoxy-hydroxy-phosphoryl] '
                        

As you can see we get lots of data back about the individual molecule we have searched for and the PDB entries in which it is contained. 

If we wanted to know the experimental methods used to determine structures of human Dihydrofolate reductase we could loop through the results and count how many entries use each experimental method. 

Be aware that the results from the search are per molecule. So each PDB entry will appear multiple times - once for each molecule.

In [25]:
def result_counter(term_to_search_for):

    pdb_list = [] # we will use this to store the PDB IDs we have already seen so we don't double count
    ret = {} # the actual results

    # we will loop through the results and 
    for result in results:
        data = result.get(term_to_search_for, '')
        if type(data) == list:
            data = ','.join(sorted(data))
        pdb_id = result.get('pdb_id')
        if pdb_id in pdb_list:
            continue # we have already seen this PDB code, move on to the next result
        ret.setdefault(data, []).append(pdb_id) # add pdb_id to a list for each experimental method
        pdb_list.append(pdb_id)
    return ret

exp_method_result = result_counter('experimental_method')
pprint(exp_method_result) 

{'Solution NMR': ['1yho'],
 'X-ray diffraction': ['5hve',
                       '3f8y',
                       '3nxt',
                       '4m6k',
                       '4m6l',
                       '1s3u',
                       '4g95',
                       '4ddr',
                       '1hfr']}


This isn't the nicest output so we can format this so it tells us how many PDB entries there are for each experimental method

In [26]:
for row in exp_method_result:
    print('{}: {} entries'.format(row, len(exp_method_result.get(row))))

X-ray diffraction: 9 entries
Solution NMR: 1 entries
