# Cofactors via PDB API

This repository contains a Jupyter Notebook to query and analyze cofactor information from the Protein Data Bank (PDB) API.

**Quick start**  
1. Install dependencies: `pip install -r requirements.txt`  
2. Open the notebook: `jupyter lab` or `jupyter notebook`  
3. Run cells top-to-bottom.

**What this notebook does**  
- Calls the PDB REST/JSON API to retrieve structures and annotations  
- Parses ligand/cofactor information  
- Aggregates results into tidy tables for downstream analysis  
- Saves clean CSV outputs

**Reproducibility**  
- All parameters are at the top in a single configuration cell.  
- All file I/O is confined to a project-relative `data/` directory.  
- Randomness is controlled (where applicable) by fixed seeds.

> Last updated: 2025-10-18

> **Note on provenance**  
> This notebook is **based on PDBe API notebooks** and **reuses some helper functions** (adapted here) for PDBe endpoints and data normalization.  
> See PDBe resources: https://www.ebi.ac.uk/pdbe/

In [5]:
from pathlib import Path
import os

PROJECT_ROOT = Path(".").resolve()
DATA_DIR = PROJECT_ROOT / "data"
OUTPUT_DIR = PROJECT_ROOT / "outputs"
DATA_DIR.mkdir(exist_ok=True)
OUTPUT_DIR.mkdir(exist_ok=True)

# Network + API
PDB_API_BASE = "https://data.rcsb.org/rest/v1"  # adjust if you use a different endpoint
REQUEST_TIMEOUT = 30  # seconds
RETRY_ATTEMPTS = 3

# Random seed for reproducibility (if applicable)
SEED = 42

In [6]:
# Utilities
import time
from typing import Any, Dict, Optional, List

try:
    import requests
except ImportError:
    raise ImportError("Please install 'requests' (pip install requests)")

def http_get(url: str, params: Optional[Dict[str, Any]] = None, timeout: int = REQUEST_TIMEOUT) -> requests.Response:
    """GET with basic retry."""
    last_err = None
    for attempt in range(1, RETRY_ATTEMPTS + 1):
        try:
            resp = requests.get(url, params=params, timeout=timeout)
            resp.raise_for_status()
            return resp
        except Exception as e:
            last_err = e
            if attempt < RETRY_ATTEMPTS:
                time.sleep(min(2**attempt, 10))
    raise last_err

def save_csv(df, path: Path):
    path.parent.mkdir(parents=True, exist_ok=True)
    df.to_csv(path, index=False)
    return path

# Finding enzyme cofactors in PDB


## i) Define the list of cofactors

* There are 27 classes of cofactors (Mukhopadhyay et al., 2019. doi: 10.1093/bioinformatics/btz115).


(https://www.ebi.ac.uk/pdbe/api/doc/compounds.html)
https://www.ebi.ac.uk/pdbe/api/pdb/compound/cofactors


# ii) Find PDB entries containing a specific ligand (cofactor)


## 1) Making imports and setting variables (1_basics.ipynb)

First, we import some packages that we will use, and set some variables.

Note: Full list of valid URLs is available from http://www.ebi.ac.uk/pdbe/api/doc/


In [12]:
# Query the PDBe API
import re
import requests

base_url = "https://www.ebi.ac.uk/pdbe/"

api_base = base_url + "api/"

summary_url = api_base + 'pdb/entry/summary/'

binding_url = api_base + 'pdb/entry/binding_sites/'

https://www.ebi.ac.uk/pdbe/api/pdb/compound/cofactors
This call provides a summary of the cofactor annotation in the PDB.


In [14]:
def get_cofactor_information(cof_api):
    URL_base = "https://www.ebi.ac.uk/pdbe/api/pdb/compound/cofactors"
    query = URL_base
    response = requests.get(query)
    if response.status_code == 200:
        return response.json()
    else:
        print("No data available")
        return None

In [15]:
# Summary: Reusable utilities for the workflow.
def retrieve_cofactors_PDB(cof_api):
    response = get_cofactor_information(cofactorclass)
    return [cofactorclass.keys]
    

In [16]:
def save_cofactor_information(cofactorclass):
    response = get_cofactor_information(cofactorclass)
    for ci in response[cofactorclass]:
        print(cofactorclass, ci["cofactors"])
    return None


In [17]:
def save_cofactor_information1(cofactorclass):
    response = get_cofactor_information(cofactorclass)
    for ci in response[cofactorclass]:
        return [cofactorclass, ci["cofactors"]]

In [19]:
cof_list = [ "Ascorbic acid", "Factor F430", "MIO", "Phosphopantetheine", "Nicotinamide-adenine dinucleotide", "Dipyrromethane",  "Molybdopterin", "Adenosylcobalamin", "Flavin adenine dinucleotide", "Tetrahydrofolic acid", "Coenzyme A", "Coenzyme B", "Flavin Mononucleotide", "Menaquinone",  "Coenzyme M", "Heme", "Biopterin", "Pyrroloquinoline Quinone", "Biotin", "Lipoic acid", "Ubiquinone", "Glutathione", "Orthoquinone residues (LTQ, TTQ, CTQ)", "S-adenosylmethionine",  "Thiamine diphosphate", "Pyridoxal 5'-phosphate", "Topaquinone"]

In [20]:
# Retrieve Chemical component code from PDB chemical component dictionary of each Coenzyme class
for cof in cof_list:
    print(save_cofactor_information1(cof))

['Ascorbic acid', ['ASC']]
['Factor F430', ['F43', 'M43']]
['MIO', ['MDO']]
['Phosphopantetheine', ['PNS']]
['Nicotinamide-adenine dinucleotide', ['0WD', '1DG', '3AA', '3CD', '6V0', '8ID', 'A3D', 'AP0', 'CND', 'DG1', 'DN4', 'EAD', 'ENA', 'LNC', 'N01', 'NA0', 'NAD', 'NAE', 'NAI', 'NAJ', 'NAP', 'NAQ', 'NAX', 'NBD', 'NBP', 'NDC', 'NDE', 'NDO', 'NDP', 'NHD', 'NPW', 'ODP', 'P1H', 'PAD', 'SAD', 'SAE', 'SND', 'TAD', 'TAP', 'TDT', 'TXD', 'TXE', 'TXP', 'ZID']]
['Dipyrromethane', ['18W', '29P', 'DPM']]
['Molybdopterin', ['2MD', 'MCN', 'MGD', 'MSS', 'MTE', 'MTQ', 'MTV', 'PCD', 'XAX']]
['Adenosylcobalamin', ['B12', 'CNC', 'COB', 'COY']]
['Flavin adenine dinucleotide', ['6FA', 'FA8', 'FAA', 'FAB', 'FAD', 'FAE', 'FAO', 'FAS', 'FCG', 'FDA', 'FED', 'FSH', 'P5F', 'RFL', 'SFD']]
['Tetrahydrofolic acid', ['1YJ', 'C2F', 'FFO', 'FON', 'FOZ', 'THF', 'THG', 'THH']]
['Coenzyme A', ['01A', '01K', '0ET', '1C4', '1CV', '1CZ', '1HA', '1VU', '1XE', '2CP', '2NE', '3CP', '3H9', '3HC', '4CA', '4CO', '8JD', '8Z2', 'AC

## Loops for retrieving PDB codes per coenzyme class

In [22]:
def get_PDB_entries_associated_cofactor(cofId):
    URL_base = "https://www.ebi.ac.uk/pdbe/api/pdb/compound/in_pdb"
    query = URL_base + "/" + cofId
    response = requests.get(query)
    if response.status_code == 200:
        return response.json()
    else:
        print("No data available")
        return None

In [23]:
 def save_PDB_cofactors1(cofId, filename):
    response = get_PDB_entries_associated_cofactor(cofId)
    with open(filename, "a") as file:
        print(response[cofId], sep=",", file=file)
    return None
    

In [None]:
# Run PDBe exports per coenzyme class
# 'save_cofactor_information1(cof)' -> returns [class_name, [CCD codes...]]
# 'save_PDB_cofactors1(cofId, filename)' -> appends PDB entries for one CCD to a file
# one CSV-like file per coenzyme class under OUTPUT_DIR / "cofactors_by_class".
from pathlib import Path
import re

OUT_DIR_CLASSES = OUTPUT_DIR / "cofactors_by_class"
OUT_DIR_CLASSES.mkdir(parents=True, exist_ok=True)

# If you already defined `cof_list` earlier, this will use it. Otherwise, this default is applied.
try:
    cof_list
except NameError:
    cof_list = [
        "Flavin Mononucleotide",
        "Flavin Adenine Dinucleotide",
        "Nicotinamide Adenine Dinucleotide",
        "Coenzyme A"
    ]

def _slugify(name: str) -> str:
    return re.sub(r'[^A-Za-z0-9]+', '_', name.strip()).strip('_')

for cof_class in cof_list:
    cls_name, ccd_codes = save_cofactor_information1(cof_class)
    if not ccd_codes:
        print(f"[WARN] No CCD codes found for class: {cof_class}")
        continue
    out_path = OUT_DIR_CLASSES / f"{_slugify(cof_class)}.csv"
    # Start fresh file and add a lightweight header/comment
    with open(out_path, "w") as f:
        f.write("# PDB codes per coenzyme class\n")
        f.write("pdb_entries_for_all_codes_by_class\n")
    # Append PDB entries list for each code
    for code in ccd_codes:
        try:
            save_PDB_cofactors1(code, out_path)  # appends entries for this code
        except Exception as e:
            print(f"[ERROR] {cof_class} / {code}: {e}")
    print(f"[OK] Wrote: {out_path}")


## Include "ATP" as an additional coenzyme class

In [26]:
save_PDB_cofactors1("ATP", "ATP_codes.csv")