#PDBe API endpoints to retrieve text-mined residue-level annotations

This Jupyter Notebook gives a few examples on how one can retrieve residue-level annotations text-mined from IUCr publications through PDBe's API (application programming interface) endpoints. We give a short decription of the endpoints and what results they return and provide some very basic code examples as a starting point for one's own developments.


The documentation and a sandpit for exploring the endpoints can be found here:
https://www.ebi.ac.uk/pdbe/api/v2


We provide endpoints that return annotations either grouped for a PDB entry or for a UniProt accession. For a PDB entry the user is also able to make a separate request that selects details based on a PDB chain identifier and a PDB residue name. For the UniProt accession one is also able to select the annotations for a specific UniProt residue. There is a distinction between PDB and UniProt numbering for a protein and the user needs to pay attention which one applies. It is also important to note that the annotations returned by the API are aggregated across publications. So unlike the annotations displayed alongside an IUCr article or on a particular PDB entry, the API can return annotations from multiple scientific areticles concerned with the same PDB entry or UniProt accessopn. There are four different endpoints through which the residue-level annotations can be accessed:
* for a PDB entry
* for a specific residue in a selected chain of a PDB entry
* for a UniProt accession (this includes all PDB entries that refer to a particular UniProt accession)
* for a UniProt accession and a selected residue (this includes all PDB entries that refer to a particular UniProt accession)


Below, we first import the necessary Python libraries for making a request to the endpoints. We then define four different functions, each contacting one of the four API endpoints to retrieve the annotations. Finally, we execute the function calls to retrieve details for one of the residues we encountered in the video tutorials, Trp495 in chain A of PDB entry 5CXT and UniProt accession Q8VH51.

All API endpoints return a JSON (JavaScript Object Notation) dictionary. The dictionary contains details about the annotation provider, here "IUCr", and then a list of residues that were annotated. For each residue we give the "startIndex" and "endIndex" for either a PDB entry or a UniProt accession and we have "additionalData". The "additionlData" is a list of all annotations found for a particular PDB entry or UniProt accession and if applicable for a specified residue. Each entry in the annotation list itself gives details about available identifiers for a publication, such as PubMed ID ("pubmedId"), PubMed Central ID ("pmcId") and a digital object identifier ("doi"). We provide details about the citation status, so the user knows whether an annotation originated from a primary or additional publication ("primaryCitation": "Y" or "N") for a PDB entry and we also indicate whether the scientific article was open access or not ("openAccess": "Y" or "N"). An annotation contains details about the PDB entry it was linked to by providing an "entityId" for a molecular entity in an entry, the "pdbChain" of the entity and the "pdbResidue". We also add an "authorResidueNumber" which may be different from the "pdbResidue". To support disambiguation we add the UniProt accession ("uniprotAccession") and the reference UniProt residue ("uniprotResidue"). The latter may be different from either the "pdbResidue" and/or the "authorResidueNumber". All residues are refered to by their sequence position, excluding the amino acid name. For the text mining details the sentence where the annotation was found is returned ("sentence") along with the section ("section") it was located in. "Exact" gives the exact text span for the annotation, "entityType" is the asigned class label for an entity, "annotator" gives the annotating model and version and "aiScore" refers to the confidence score of the model for this annotation. Below is an example dictionary of the returned result.

{
  "dataType": "ANNOTATIONS",
  "data": [
    {
      "provider": "IUCr",
      "residueList": [
        {
          "startIndex": 42,
          "endIndex": 42,
          "indexType": "PDB",
          "additionalData": [
            {
              "pubmedId": 1234567,
              "pmcId": "PMC1234567",
              "doi": "10.1234/abcd.efgh",
              "primaryCitation": "Y",
              "openAccess": "Y",
              "entityId": 1,
              "pdbResidue": 42,
              "authorResidueNumber": 42,
              "pdbChain": "A",
              "uniprotAccession": "P12345",
              "uniprotResidue": 42,
              "sentence": "This is a sample sentence.",
              "section": "Title",
              "exact": "Ala33",
              "entityType": "protein",
              "annotator": "autoannotator_v2.1",
              "aiScore": 0.95
            }
          ]
        }
      ]
    }
  ]
}


### Setting up the environment to make API calls
Importing the necessary packages and libraries.

In [31]:
import requests
import json
from typing import Dict

### API call for a PDB entry
To retrieve the residue-level annotation for a particular PDB entry, we construct the URL for our API endpoint to request the details from PDBe's database as part of a predefined query. The only information we need to provide is the identifier for the PDB entry (pdb_entry) we re interested in.

In [32]:
def get_text_mined_annotations_for_pdb_entry(pdb_entry: str) -> Dict[str, str]:
  base = "https://www.ebi.ac.uk/pdbe/api/v2/pdb/entry/llm_annotations/summary/"
  full = base + pdb_entry
  result = requests.get(full).json()

  return result

### API call for a specific residue in a selected chain of a PDB entry

To retrieve the residue-level annotation for a specific residue of a selected chain of a particular PDB entry, we construct the URL for our API endpoint to request the details from PDBe's database as part of a predefined query. The required information is the identifier for the PDB entry (pdb_entry) we re interested in along with the selected chain (pdb_chain) and a specific residue (pdb_residue). Note, the residue is given as sequence position, i.e. an integer.

In [33]:
def get_text_mined_annotations_for_pdb_entry_chain_and_residue(pdb_entry: str,
                                                               pdb_chain: str,
                                                               pdb_residue: str
                                                               ) -> Dict[str, str]:
  base = "https://www.ebi.ac.uk/pdbe/api/v2/pdb/entry/llm_annotations/summary/"
  full = base + pdb_entry + "/" + pdb_chain + "/" + pdb_residue
  print(full)
  result = requests.get(full).json()

  return result

### API call for a UniProt accession

To retrieve the residue-level annotation for a particular UniProt accession, we construct the URL for our API endpoint to request the details from PDBe's database as part of a predefined query. The only information we need to provide is UniProt accession (uniprot_accession) we re interested in.

In [34]:
def get_text_mined_annotations_for_uniprot_accession(uniprot_accession: str
                                                     ) -> Dict[str, str]:
  base = "https://www.ebi.ac.uk/pdbe/api/v2/uniprot/llm_annotations/summary/"
  full = base + uniprot_accession
  result = requests.get(full).json()

  return result





### API call for a specific residue in UniProt accession

To retrieve the residue-level annotation for a specific residue in a particular UniProt accession, we construct the URL for our API endpoint to request the details from PDBe's database as part of a predefined query. The required information is the identifier for the UniProt accession (uniprot_accession) we re interested in and a specific residue (uniprot_residue). Note, the residue is given as sequence position, i.e. an integer.

In [35]:
def get_text_mined_annotations_for_uniprot_accession_and_residue(
    uniprot_accession: str,
    uniprot_residue: str
    ) -> Dict[str, str]:
  base = "https://www.ebi.ac.uk/pdbe/api/v2/uniprot/llm_annotations/summary/"
  full = base + uniprot_accession + "/" + uniprot_residue
  print(full)
  result = requests.get(full).json()

  return result

### Executing the above requests

Below are the function calls for the requests we defined above.

1. Getting annotations for a particular PDB entry

In [36]:
get_text_mined_annotations_for_pdb_entry("5cxt")

{'5cxt': {'dataType': 'ANNOTATIONS',
  'data': [{'provider': 'IUCr',
    'residueList': [{'startIndex': 79,
      'endIndex': 79,
      'indexType': 'PDB',
      'additionalData': [{'pubmedId': 27050129,
        'pmcId': 'PMC4822562',
        'doi': '10.1107/S2059798316001248',
        'primaryCitation': 'Y',
        'openAccess': 'N',
        'entityId': 1,
        'pdbResidue': 79,
        'authorResidueNumber': 495,
        'pdbChain': 'C',
        'uniprotAccession': 'Q8VH51',
        'uniprotResidue': 495,
        'sentence': 'The reciprocal tryptophan binding sites (Trp495 and Trp92) and the nearby residues proline (Pro95) and phenylalanine (Phe496) are shown as stick diagrams.',
        'section': 'display-objects',
        'exact': 'Trp495',
        'entityType': 'residue_name_number',
        'annotator': 'autoannotator_v2.1_quant',
        'aiScore': 0.9994893},
       {'pubmedId': 27050129,
        'pmcId': 'PMC4822562',
        'doi': '10.1107/S2059798316001248',
        'p

2. Getting annotations for a specific residue of a selected chain in a particular PDB entry

NOTE: For this request the residue needs to be numbered according the PDB structure.

In [37]:
get_text_mined_annotations_for_pdb_entry_chain_and_residue("5cxt", "A", "79")

https://www.ebi.ac.uk/pdbe/api/v2/pdb/entry/llm_annotations/summary/5cxt/A/79


{'5cxt': {'chainId': 'A',
  'residueId': 79,
  'dataType': 'ANNOTATIONS',
  'data': [{'provider': 'IUCr',
    'residueList': [{'startIndex': 79,
      'endIndex': 79,
      'indexType': 'PDB',
      'additionalData': [{'pubmedId': 27050129,
        'pmcId': 'PMC4822562',
        'doi': '10.1107/S2059798316001248',
        'primaryCitation': 'Y',
        'openAccess': 'N',
        'entityId': 1,
        'pdbResidue': 79,
        'authorResidueNumber': 495,
        'pdbChain': 'A',
        'uniprotAccession': 'Q8VH51',
        'uniprotResidue': 495,
        'sentence': 'Trp495 of RBM39-UHM is engaged in hydrophobic interactions with the ULM C-terminal prolines Pro95 and Pro96.',
        'section': 'display-objects',
        'exact': 'Trp495',
        'entityType': 'residue_name_number',
        'annotator': 'autoannotator_v2.1_quant',
        'aiScore': 0.9994568},
       {'pubmedId': 27050129,
        'pmcId': 'PMC4822562',
        'doi': '10.1107/S2059798316001248',
        'primaryCit

3. Getting annotations for a particular UniProt accession

In [38]:
get_text_mined_annotations_for_uniprot_accession("Q8VH51")

{'Q8VH51': {'dataType': 'ANNOTATIONS',
  'data': [{'pdbId': '5cxt',
    'providerList': [{'provider': 'IUCr',
      'residueList': [{'startIndex': 79,
        'endIndex': 79,
        'indexType': 'PDB',
        'additionalData': [{'pubmedId': 27050129,
          'pmcId': 'PMC4822562',
          'doi': '10.1107/S2059798316001248',
          'primaryCitation': 'Y',
          'openAccess': 'N',
          'entityId': 1,
          'pdbResidue': 79,
          'authorResidueNumber': 495,
          'pdbChain': 'C',
          'uniprotAccession': 'Q8VH51',
          'uniprotResidue': 495,
          'sentence': 'The reciprocal tryptophan binding sites (Trp495 and Trp92) and the nearby residues proline (Pro95) and phenylalanine (Phe496) are shown as stick diagrams.',
          'section': 'display-objects',
          'exact': 'Trp495',
          'entityType': 'residue_name_number',
          'annotator': 'autoannotator_v2.1_quant',
          'aiScore': 0.9994893},
         {'pubmedId': 27050129,
  

4. Getting annotations for a specific residue in a particular UniProt accession

NOTE: For this request the residue needs to be numbered according the UniProt accession.

In [39]:
get_text_mined_annotations_for_uniprot_accession_and_residue("Q8VH51", "495")

https://www.ebi.ac.uk/pdbe/api/v2/uniprot/llm_annotations/summary/Q8VH51/495


{'Q8VH51': {'uniprotResidue': '495',
  'dataType': 'ANNOTATIONS',
  'data': [{'pdbId': '5cxt',
    'providerList': [{'provider': 'IUCr',
      'residueList': [{'startIndex': 79,
        'endIndex': 79,
        'indexType': 'PDB',
        'additionalData': [{'pubmedId': 27050129,
          'pmcId': 'PMC4822562',
          'doi': '10.1107/S2059798316001248',
          'primaryCitation': 'Y',
          'openAccess': 'N',
          'entityId': 1,
          'pdbResidue': 79,
          'authorResidueNumber': 495,
          'pdbChain': 'C',
          'uniprotAccession': 'Q8VH51',
          'uniprotResidue': 495,
          'sentence': 'The reciprocal tryptophan binding sites (Trp495 and Trp92) and the nearby residues proline (Pro95) and phenylalanine (Phe496) are shown as stick diagrams.',
          'section': 'display-objects',
          'exact': 'Trp495',
          'entityType': 'residue_name_number',
          'annotator': 'autoannotator_v2.1_quant',
          'aiScore': 0.9994893},
       