Skip to content

feat(scripts): PubChem enricher for small-molecule scalars (#165)#204

Merged
gerchowl merged 2 commits into
mainfrom
feature/165-pubchem
May 6, 2026
Merged

feat(scripts): PubChem enricher for small-molecule scalars (#165)#204
gerchowl merged 2 commits into
mainfrom
feature/165-pubchem

Conversation

@gerchowl
Copy link
Copy Markdown
Contributor

@gerchowl gerchowl commented May 6, 2026

Summary

  • New scripts/enrich_from_pubchem.py — cross-check enricher for pure compounds in gases.toml / liquids.toml. PubChem (NCBI/NLM, PD-USGov) is the cleanest source for small-molecule MW + MP for elements/gases that NIST WebBook doesn't surface; NIST WebBook (Integrate NIST Chemistry WebBook (SRD 69) — gases & liquids #159, PR feat(scripts): NIST WebBook (SRD 69) enricher for gases & liquids (#159) #203) stays the primary populator for fluid thermophysical properties.
  • Parses PUG-REST (/property/Title,MolecularFormula,MolecularWeight/JSON) for MW + canonical title and PUG-View (/data/compound/<CID>/JSON?heading=...) for the experimental-section Density / Melting Point / Boiling Point strings.
  • Writeback target: thermal.melting_point (clean °C from PubChem). mechanical.density is gated — comparison only for gases (PubChem reports the cryo/critical density, not STP), add-only writeback for liquids.
  • MW + boiling point are surfaced in the report and folded into the _sources note in K, but not written to TOML — the runtime schema lacks molecular_weight_g_mol and thermal.boiling_point. Documented as a follow-up; the parser already collects them so a future schema PR only needs new PropertySpec lines.

Judgment calls

  • Density string parser: PubChem Density strings are a mix of clean values ("0.9950 g/cu cm at 25 °C"), wrong-unit values ("1.251 g/L"), table refs ("[Table#8152]"), and prose. Per the issue scope guard, the regex is conservative: only <float> g/cu cm and <float> g/cm3 patterns match.
  • Gas density gating: every gas entry has density_writeback=False, so even when PubChem returns a parseable density it never overwrites our STP value (it's the wrong reference state). Comparison still runs and reports the DIFF.
  • CID coverage: water, glycerol; nitrogen, argon, helium, co2, methane. Ethanol excluded (no liquids.ethanol yet); air excluded (mixture, not a pure compound). All CIDs verified live via Title field on 2026-05-07.
  • Tolerance: 5% relative diff for DIFF flag (looser than NIST WebBook's 2%) — PubChem aggregates many primary sources at varying reference temperatures.
  • License tag: PD-USGov. Verified via existing LICENSE_ALLOWLIST canary test.

Live comparison (PubChem vs ours, full run)

material density melting_point MW BP
water OK ours=0.998 wd=0.995 (0.3%) OK 0 °C 18.015 373.1 K
glycerol OK ours=1.261 wd=1.261 (0.0%) OK 18.1 °C (0.6%) 92.09 563.1 K
nitrogen DIFF ours=0.001165 wd=0.311 (STP vs critical) (missing → would-write) 28.014 77.36 K
argon (no value) (missing → would-write -189.4 °C) 39.9 87.3 K
helium (no value) (missing → would-write -272.2 °C) 4.0026 4.222 K
co2 (no value) (missing → would-write -56.56 °C) 44.009 194.7 K
methane (no value) (missing → would-write -182.6 °C) 16.043 111.6 K

The nitrogen density DIFF is the gating canary — PubChem's "Critical density: 0.311 g/cu cm" matches our regex, but density_writeback=False blocks the write. Working as designed.

Test plan

  • pytest tests/test_pubchem_enrichment.py tests/test_curation_helpers.py -v — 23 + 20 = 43 passed
  • pytest (full suite) — 671 passed, 35 skipped (visual regression / Playwright unavailable)
  • ruff check . && ruff format --check . — clean
  • Live python scripts/enrich_from_pubchem.py --key water --dry-run — works against the live PubChem API
  • Live full python scripts/enrich_from_pubchem.py --dry-run — all 7 CIDs return useful data; gating prevents the gas-density mismatch from contaminating writes
  • Round-trip preserves comments (test_write_adds_missing_melting_point_and_sources_row)
  • No-overwrite invariant (test_write_does_not_overwrite_existing_value)
  • Density-writeback gate works for gases (test_density_not_written_for_gas_entry)

Closes #165.

Adds `scripts/enrich_from_pubchem.py` — a cross-check enricher for pure
compounds in the gases & liquids catalogs. PubChem (NCBI/NLM, PD-USGov)
is the lane this fills: per-compound MW, melting point, boiling point,
and (parseable-when-clean) density. NIST WebBook (#159, PR #203) stays
the primary populator for fluid thermophysical properties; PubChem is
only the sanity check + the source of MP for elements/gases that NIST
WebBook doesn't cleanly surface.

## Properties enriched
- `thermal.melting_point` (°C → degC) — primary writeback target. Clean
  for every CID we ship.
- `mechanical.density` (g/cm³) — comparison only for gases; add-only
  writeback for liquids.

## Properties dropped (schema gap, documented)
- `molecular_weight_g_mol` — not in `MechanicalProperties` today.
- `thermal.boiling_point` — not in `ThermalProperties` today.
Both are still surfaced in the report and folded into the `_sources`
note (in K) so curators see what PubChem said. Adding the schema fields
is out of scope for #165.

## Judgment calls
- **Density string parser**: PubChem `Density` strings are a mix of
  clean values (`"0.9950 g/cu cm at 25 °C"`), wrong-unit values
  (`"1.251 g/L"`), table refs (`"[Table#8152]"`), and prose
  (`"VAPOR DENSITY @ NORMAL TEMP APPROX SAME AS AIR"`). Per the issue
  scope guard, the regex is conservative: only `<float> g/cu cm` and
  `<float> g/cm3` patterns match. Everything else returns None and
  surfaces as `(no value)` in the report.
- **Gas density gating**: PubChem density entries for gases are mostly
  the cryogenic-liquid phase, not the STP gas density we store.
  Comparing always flags DIFF (e.g. nitrogen 0.001165 vs PubChem 0.311 —
  PubChem's critical density). `density_writeback=False` on every gas
  entry blocks the writeback path so we never overwrite an STP value
  with a cryo value. Comparison still runs and reports the DIFF.
- **CID coverage**: water, glycerol (liquids); nitrogen, argon, helium,
  CO2, methane (gases). Ethanol is in the issue's suggested CID list
  but `liquids.ethanol` doesn't exist in the catalog yet, so it's
  excluded; air is a mixture and PubChem isn't a meaningful source.
  Oxygen/hydrogen/neon/xenon left out for the initial scope.
- **Tolerance**: 5% relative diff for DIFF flag (looser than NIST
  WebBook's 2%) — PubChem aggregates many primary sources at varying
  reference temperatures.
- **License tag**: `PD-USGov`. Per NCBI policies PubChem data is freely
  usable including commercially; PD-USGov is the closest fit in our
  allow-list. Verified via the existing `LICENSE_ALLOWLIST` canary test.

## Live results (full run, fresh fetch)
- water: density OK (0.3%), MP OK
- glycerol: density OK (0.0%), MP OK (0.6%)
- nitrogen: density DIFF (expected — STP vs critical), MP missing → would-write
- argon, helium, co2, methane: density (no value), MP missing → would-write

Closes #165.
@gerchowl gerchowl enabled auto-merge (squash) May 6, 2026 22:52
Two CI failures on the captured PubChem PUG-View fixture
(`tests/fixtures/pubchem_water.json`):

* `end-of-file-fixer`: append trailing newline.
* `typos`: rewrote PubChem's `"ANID"` (annotation node ID) to `"AND"`
  inside the JSON response. Add `ANID = "ANID"` to `.typos.toml` so
  the fixture stays byte-identical to what PubChem actually serves —
  rewriting it would silently diverge from real responses.
@gerchowl gerchowl merged commit c964f9a into main May 6, 2026
19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Integrate PubChem (public domain) — pure-compound chemistry scalars

1 participant