feat(scripts): PubChem enricher for small-molecule scalars (#165)#204
Merged
Conversation
Adds `scripts/enrich_from_pubchem.py` — a cross-check enricher for pure compounds in the gases & liquids catalogs. PubChem (NCBI/NLM, PD-USGov) is the lane this fills: per-compound MW, melting point, boiling point, and (parseable-when-clean) density. NIST WebBook (#159, PR #203) stays the primary populator for fluid thermophysical properties; PubChem is only the sanity check + the source of MP for elements/gases that NIST WebBook doesn't cleanly surface. ## Properties enriched - `thermal.melting_point` (°C → degC) — primary writeback target. Clean for every CID we ship. - `mechanical.density` (g/cm³) — comparison only for gases; add-only writeback for liquids. ## Properties dropped (schema gap, documented) - `molecular_weight_g_mol` — not in `MechanicalProperties` today. - `thermal.boiling_point` — not in `ThermalProperties` today. Both are still surfaced in the report and folded into the `_sources` note (in K) so curators see what PubChem said. Adding the schema fields is out of scope for #165. ## Judgment calls - **Density string parser**: PubChem `Density` strings are a mix of clean values (`"0.9950 g/cu cm at 25 °C"`), wrong-unit values (`"1.251 g/L"`), table refs (`"[Table#8152]"`), and prose (`"VAPOR DENSITY @ NORMAL TEMP APPROX SAME AS AIR"`). Per the issue scope guard, the regex is conservative: only `<float> g/cu cm` and `<float> g/cm3` patterns match. Everything else returns None and surfaces as `(no value)` in the report. - **Gas density gating**: PubChem density entries for gases are mostly the cryogenic-liquid phase, not the STP gas density we store. Comparing always flags DIFF (e.g. nitrogen 0.001165 vs PubChem 0.311 — PubChem's critical density). `density_writeback=False` on every gas entry blocks the writeback path so we never overwrite an STP value with a cryo value. Comparison still runs and reports the DIFF. - **CID coverage**: water, glycerol (liquids); nitrogen, argon, helium, CO2, methane (gases). Ethanol is in the issue's suggested CID list but `liquids.ethanol` doesn't exist in the catalog yet, so it's excluded; air is a mixture and PubChem isn't a meaningful source. Oxygen/hydrogen/neon/xenon left out for the initial scope. - **Tolerance**: 5% relative diff for DIFF flag (looser than NIST WebBook's 2%) — PubChem aggregates many primary sources at varying reference temperatures. - **License tag**: `PD-USGov`. Per NCBI policies PubChem data is freely usable including commercially; PD-USGov is the closest fit in our allow-list. Verified via the existing `LICENSE_ALLOWLIST` canary test. ## Live results (full run, fresh fetch) - water: density OK (0.3%), MP OK - glycerol: density OK (0.0%), MP OK (0.6%) - nitrogen: density DIFF (expected — STP vs critical), MP missing → would-write - argon, helium, co2, methane: density (no value), MP missing → would-write Closes #165.
Two CI failures on the captured PubChem PUG-View fixture (`tests/fixtures/pubchem_water.json`): * `end-of-file-fixer`: append trailing newline. * `typos`: rewrote PubChem's `"ANID"` (annotation node ID) to `"AND"` inside the JSON response. Add `ANID = "ANID"` to `.typos.toml` so the fixture stays byte-identical to what PubChem actually serves — rewriting it would silently diverge from real responses.
This was referenced May 6, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
scripts/enrich_from_pubchem.py— cross-check enricher for pure compounds ingases.toml/liquids.toml. PubChem (NCBI/NLM,PD-USGov) is the cleanest source for small-molecule MW + MP for elements/gases that NIST WebBook doesn't surface; NIST WebBook (Integrate NIST Chemistry WebBook (SRD 69) — gases & liquids #159, PR feat(scripts): NIST WebBook (SRD 69) enricher for gases & liquids (#159) #203) stays the primary populator for fluid thermophysical properties./property/Title,MolecularFormula,MolecularWeight/JSON) for MW + canonical title and PUG-View (/data/compound/<CID>/JSON?heading=...) for the experimental-section Density / Melting Point / Boiling Point strings.thermal.melting_point(clean °C from PubChem).mechanical.densityis gated — comparison only for gases (PubChem reports the cryo/critical density, not STP), add-only writeback for liquids._sourcesnote in K, but not written to TOML — the runtime schema lacksmolecular_weight_g_molandthermal.boiling_point. Documented as a follow-up; the parser already collects them so a future schema PR only needs newPropertySpeclines.Judgment calls
Densitystrings are a mix of clean values ("0.9950 g/cu cm at 25 °C"), wrong-unit values ("1.251 g/L"), table refs ("[Table#8152]"), and prose. Per the issue scope guard, the regex is conservative: only<float> g/cu cmand<float> g/cm3patterns match.density_writeback=False, so even when PubChem returns a parseable density it never overwrites our STP value (it's the wrong reference state). Comparison still runs and reports the DIFF.liquids.ethanolyet); air excluded (mixture, not a pure compound). All CIDs verified live viaTitlefield on 2026-05-07.PD-USGov. Verified via existingLICENSE_ALLOWLISTcanary test.Live comparison (PubChem vs ours, full run)
The
nitrogendensity DIFF is the gating canary — PubChem's "Critical density: 0.311 g/cu cm" matches our regex, butdensity_writeback=Falseblocks the write. Working as designed.Test plan
pytest tests/test_pubchem_enrichment.py tests/test_curation_helpers.py -v— 23 + 20 = 43 passedpytest(full suite) — 671 passed, 35 skipped (visual regression / Playwright unavailable)ruff check . && ruff format --check .— cleanpython scripts/enrich_from_pubchem.py --key water --dry-run— works against the live PubChem APIpython scripts/enrich_from_pubchem.py --dry-run— all 7 CIDs return useful data; gating prevents the gas-density mismatch from contaminating writestest_write_adds_missing_melting_point_and_sources_row)test_write_does_not_overwrite_existing_value)test_density_not_written_for_gas_entry)Closes #165.