
# Exploring Molecules with Chemical Databases  
*Session 3: Integrating PubChem Queries*




## 🧪 Introduction

In this session, you'll use **chemical databases** to fetch molecular information programmatically.

- **PubChem** (NIH): open database of chemical molecules and their properties (identifiers, SMILES, molecular weight, logP, etc.).  


**What you'll do**
- Run a **demo** query for *aspirin* (or *caffeine*) from **PubChem**.  
- See key properties:
  - **SMILES**, **Molecular Weight (MW)**, **logP** (a proxy for hydrophobicity).  
- Then try your **own molecule** and render results in a clean table.  
- Guiding question: **“Based on logP, which molecule is more hydrophobic?”**



## ✅ Setup (Run this once)
- Install small dependencies
- Import helper libraries


In [None]:
# We use 'requests' to call PubChem's web API (PUG REST) and 'pandas' to show tidy tables.
# Widgets are for simple UI controls (text boxes, buttons) in the notebook.
!pip -q install ipywidgets pandas requests

import os, json, math
import ipywidgets as widgets
import pandas as pd
import requests
from IPython.display import display, Markdown



## 🧩 Helper Functions (What these do)
- `pubchem_props_by_name(name)` → uses **PubChem PUG REST** to get **CID** and properties (SMILES, MW, XLogP).  
- `format_props_table(records)` → turns results into a **pandas DataFrame** for a clean table view.  
- If the web call fails (no internet / bad name), you get a **friendly message** instead of a crash.


In [None]:
# PubChem PUG REST root. We append specific paths for name→CID and CID→properties calls.
PUBCHEM_BASE = "https://pubchem.ncbi.nlm.nih.gov/rest/pug"

def pubchem_props_by_name(name: str):
   """
    Chemistry context:
      - PubChem stores molecules under a CID (Compound ID).
      - We first resolve the common name (e.g., "aspirin") to a CID.
      - Then, we fetch key properties: SMILES (a text encoding of structure),
        Molecular Weight (g/mol), and XLogP (estimate of lipophilicity; higher ⇒ more hydrophobic).

    Returns a dict for a single molecule with:
      input (the name you typed), cid, smiles, mw (g/mol), logP (unitless),
      or an 'error' message if PubChem can't find it.
    """
    try:
         # 1) Resolve common name → CID (Compound ID).
        #    Example: /compound/name/aspirin/cids/JSON  → gives back the CID (e.g., 2244)
        cid_url = f"{PUBCHEM_BASE}/compound/name/{requests.utils.quote(name)}/cids/JSON"
        r = requests.get(cid_url, timeout=20, headers={"Accept": "application/json"})
        r.raise_for_status()
        data = r.json().get("IdentifierList", {}).get("CID", [])
        if not data:
            return {"input": name, "error": "No CID found for this name."}
        cid = data[0]

        # 2) CID -> properties
        prop_url = (f"{PUBCHEM_BASE}/compound/cid/{cid}/property/"
                f"IsomericSMILES,CanonicalSMILES,MolecularWeight,XLogP/JSON")
        r2 = requests.get(prop_url, timeout=20, headers={"Accept": "application/json"})
        r2.raise_for_status()
        props = r2.json().get("PropertyTable", {}).get("Properties", [])
        if not props:
            return {"input": name, "cid": cid, "error": "No properties returned for this CID."}
        p = props[0]

         # Choose the first available SMILES string.

        smiles = p.get("IsomericSMILES") or p.get("CanonicalSMILES")
        if not smiles:
          smiles = _fallback_smiles_from_pug_view(cid)
          if smiles:
            print("DEBUG: SMILES recovered from PUG-View.")
          else:
            print("DEBUG: SMILES still missing after fallback.")
        return {
            "input": name,
            "cid": cid,
            "smiles": smiles,
            "mw": p.get("MolecularWeight"),
            "logP": p.get("XLogP"),
        }

    except Exception as e:
        return {"input": name, "error": f"Request failed: {e}"}

def format_props_table(records):
    rows = []
    for rec in records:
        rows.append({
            "Input": rec.get("input"),
            "CID": rec.get("cid", ""),
            "SMILES": rec.get("smiles", ""),
            "Molecular Weight": rec.get("mw", ""),
            "logP (XLogP)": rec.get("logP", ""),
            "Error": rec.get("error", "")
        })
    df = pd.DataFrame(rows, columns=["Input","CID","SMILES","Molecular Weight","logP (XLogP)","Error"])
    return df



## 💡 Demo — Query a known compound (Aspirin)

Click **Fetch** to query **PubChem** for *aspirin* or switch to *caffeine* using the dropdown.  
We’ll display **SMILES**, **MW**, **logP**, and a short explanation.


In [None]:
# Chemist-facing UI:
# - Pick aspirin/caffeine and click Fetch.
# - Under the hood: name → CID → properties from PubChem.
# - We then display a tidy table and a short chemistry summary (SMILES, MW, logP).

demo_pick = widgets.Dropdown(
    options=[("aspirin","aspirin"), ("caffeine","caffeine")],
    value="aspirin", description="Compound:"
)
demo_btn = widgets.Button(description="Fetch", button_style="success")
demo_out = widgets.Output()

def on_demo(_):
    demo_out.clear_output()
    name = demo_pick.value.strip()
    with demo_out:
        display(Markdown(f"⏳ Querying PubChem for **{name}** ..."))
    rec = pubchem_props_by_name(name)
    df = format_props_table([rec])
    demo_out.clear_output()
    with demo_out:
        display(Markdown("### Results (PubChem)"))
        display(df)
        if not rec.get("error"):
            lp = rec.get("logP", None)
            extra = ""
            if lp is not None and lp != "":
                extra = f" A higher logP typically indicates a **more hydrophobic** molecule."
            display(Markdown(
                f"**Plain-English explanation:**\n\n"
                f"- **SMILES** encodes structure as text.\n"
                f"- **Molecular Weight** helps estimate mass-related properties.\n"
                f"- **logP (XLogP)** estimates lipophilicity (hydrophobicity).{extra}"
            ))
        else:
            display(Markdown(f"**Note:** {rec['error']}"))

demo_btn.on_click(on_demo)
display(widgets.HBox([demo_pick, demo_btn]), demo_out)


Normally, you’ll just look at the clean results table (SMILES, MW, logP).
But sometimes values may look missing, or you might be curious about what PubChem actually sends back.
Below is an optional toggle: if you check it, you’ll see the raw JSON straight from the PubChem API. This is exactly what our code is parsing behind the scenes.

In [None]:
import ipywidgets as widgets
from IPython.display import display, Markdown
import json, requests

# Checkbox for showing raw JSON
raw_toggle = widgets.Checkbox(value=False, description="Show raw JSON")
raw_out = widgets.Output()
display(raw_toggle, raw_out)

def show_raw_json(change=None):
    raw_out.clear_output()
    if raw_toggle.value:
        with raw_out:
            display(Markdown("**Raw JSON output from PubChem API:**"))
            print(json.dumps(resp, indent=2)[:2000])  # truncated for readability

# Example: directly request aspirin properties again
cid = 2244
url = f"{PUBCHEM_BASE}/compound/cid/{cid}/property/IsomericSMILES,CanonicalSMILES,SMILES,ConnectivitySMILES,MolecularWeight,XLogP/JSON"
resp = requests.get(url).json()

# Connect the toggle to the function
raw_toggle.observe(show_raw_json, names='value')


## ✍️ Exercise — Your Molecule(s)

- Type the **name** of a molecule (e.g., *ibuprofen*, *nicotine*, *paracetamol*).  
- Optionally type a **second** molecule to compare **logP** (hydrophobicity).  
- Click **Run** → We’ll query **PubChem**, and show a **table** of properties.

**Guiding task:** “Based on logP, which molecule is more hydrophobic?”


In [None]:

mol1 = widgets.Text(placeholder="e.g., ibuprofen", description="Mol A:")
mol2 = widgets.Text(placeholder="(optional) e.g., paracetamol", description="Mol B:")
run_btn = widgets.Button(description="Run", button_style="primary")
ex_out = widgets.Output()

def on_run(_):
    ex_out.clear_output()
    names = [n.strip() for n in [mol1.value, mol2.value] if n.strip()]
    if not names:
        with ex_out: display(Markdown("*Please type at least one molecule name.*")); return

    with ex_out: display(Markdown("⏳ Querying PubChem ..."))
# Call PubChem for each input independently (parallelization not needed here for clarity)

    recs = [pubchem_props_by_name(n) for n in names]
    df = format_props_table(recs)

    ex_out.clear_output()
    with ex_out:
        display(Markdown("### Your Results"))
        display(df)
        valid = [r for r in recs if not r.get("error") and isinstance(r.get("logP"), (int,float))]
        if len(valid) >= 2:
            sorted_lp = sorted(valid, key=lambda r: (r.get("logP") if r.get("logP") is not None else float("-inf")), reverse=True)
            top = sorted_lp[0]
            display(Markdown(
                f"**Hydrophobicity hint:** Based on **logP**, **{top['input']}** appears **more hydrophobic** "
                f"(logP ≈ {top['logP']})."
            ))


run_btn.on_click(on_run)
display(mol1, mol2, run_btn, ex_out)



## 📘 Reflection
- Did the **logP** values align with your expectations of hydrophobicity?  
- Were any names ambiguous or missing? How would you disambiguate (IUPAC name, CID, or structure)?  
- How might you extend this to **batch queries** for a set of candidate molecules?
