feat(scripts): extend Wikidata enricher to compounds + new properties (#158)#198
Merged
Conversation
…#158) Build on the Phase 4 prep PR (#197) that extracted the shared curation helpers — this is the first Tier-1 Phase 4 implementation. What changes in `scripts/enrich_from_wikidata.py`: * **Coverage**: iterates every material in `src/pymat/data/*.toml` via `_curation.load_material_keys`, looks up the QID from `[<material>.sourcing].wikidata` (preferred) or the curator fallback dict, and skips materials without a QID. No new materials introduced — that's Phase 5. * **New properties**: P2101 melting point and P2054 density (both pre-existing) plus P2102 boiling point, P2068 thermal conductivity, and P2056 heat capacity. P2153 Young's modulus and P2055 resistivity are deferred — both need per-grade resolution the current schema can't express; rationale captured in the module docstring. * **`--write` flag (NEW)**: conservative add-only writeback. When our value is missing and Wikidata has one, the enricher writes the value plus a paired `_sources` row built via `_curation.build_source_row` (kind=qid, license=CC0, note= "Pxxxx via SPARQL <date>"). Existing values are NEVER overwritten — DIFF cases stay advisory and surface in the report only. * **`--report <path>` flag (NEW)**: redirect the markdown report to a file instead of stdout. `--key` and `--dry-run` preserved. Heat-capacity unit normalization handles J/(kg·K) and J/(g·K) (the ×1000 case); J/(mol·K) is intentionally skipped because converting needs a molar-mass lookup that's out of scope for this PR. Boiling point is fetched + reported but never written — the schema has no `boiling_point` field on `ThermalProperties` today, and adding one would touch runtime code (out of scope). Tests in `tests/test_wikidata_enrichment.py` cover: `--dry-run` output via the bundled fixture, `--write` add-only path (missing-field + sources row attached), `--write` does NOT overwrite divergent values, comment-preserving round-trip at the grade level, comparison-only never mutates files, and `--report` file output. Gated on `pytest.importorskip("requests"/"tomlkit")` matching the `tests/test_curation_helpers.py` pattern. No runtime changes — `requests`/`tomlkit` stay in `scripts/requirements-curation.txt`. No new materials added; no loader/properties.py touched. Closes #158
Caught by the pre-commit `typos` hook in CI; local install didn't run hooks because they weren't installed in this worktree.
6 tasks
5 tasks
gerchowl
added a commit
that referenced
this pull request
May 6, 2026
#167) (#200) Adds scripts/enrich_from_geant4_nist.py — a sibling to enrich_from_wikidata.py (#198) that cross-checks scintillator and plastic entries against the constants shipped in Geant4 v11.2.0's G4NistMaterialBuilder.cc. Uses the shared curation helpers from #197 (load_material_keys, build_source_row, writeback, fmt_delta). Coverage: - density (g/cm³, comparison + add-only writeback) - mean_excitation_energy_eV (Geant4's mean ionisation potential — schema field added in #157), written into [<material>.nuclear] - composition: skipped — py-mat has no schema field for element-fraction arrays today; documented in the module docstring The Geant4 numbers are mirrored by hand into a G4_NIST dict pinned to v11.2.0, with line-number citations into G4NistMaterialBuilder.cc. We do NOT fetch the source at runtime — minor versions are stable across these constants and curation tooling must be reproducible offline. CONTROVERSIAL: extends the license allow-list with `Geant4-SL` (Geant4 Software License — BSD-like with attribution). Touches: - scripts/check_licenses.py:ALLOWED - scripts/_curation.py:LICENSE_ALLOWLIST - docs/data-policy.md (new row in Allowed-licenses + clarifying paragraph) - tests/test_check_licenses.py (parametrized accept-list) Rationale for picking "extend the allow-list" over "use proprietary-reference- only": the Geant4 SL is materially more permissive than the proprietary- reference label implies (full redistribution allowed with attribution), and distinguishing it explicitly sets the right precedent for the BSD-like sources we'll hit next (HEPData mirrors, third-party NIST compilations). If a reviewer prefers Option B (collapse into proprietary-reference-only), the change is a 3-line revert in _curation/check_licenses + a paragraph deletion in data- policy.md. Calling this out so the decision is reviewable rather than buried. Materials matched (10): bgo, nai, nai.Tl, csi, csi.Tl, csi.Na, plastic_scint, plastic_scint.BC400, plastic_scint.EJ200, pwo (density only — G4_PbWO4 carries pot=0.0, the compute-from-composition sentinel). Plastics matched (7): pmma (→ G4_LUCITE), pc, ptfe, pe, nylon, delrin, pctfe. Skipped with logged reason: lyso, lyso.Ce, labr3, peek, ultem, esr, pla, abs, petg, tpu, vespel, torlon (and their grade-level descendants). Tests: 10 new in tests/test_geant4_enrichment.py covering --dry-run, --write add-only, no-overwrite-existing, comment-preserving round-trip, license-allowlist canary, source-row sanity, and the PWO-MEE-is-None edge case. Existing curation/Wikidata/check-licenses tests: 44 pass unchanged. Closes #167
5 tasks
gerchowl
added a commit
that referenced
this pull request
May 6, 2026
…) (#203) Adds `scripts/enrich_from_nist_webbook.py`, a comparison + add-only writeback enricher for the five fluids that overlap between `src/pymat/data/{gases,liquids}.toml` and the NIST Chemistry WebBook: water (liquid), nitrogen, argon, helium, co2 (gas). NIST has no JSON API; the script hits the IsoBar TSV endpoint (`fluid.cgi?Action=Data&Type=IsoBar`) at a fixed pressure of 1.01325 bar (1 atm) over a 5 K range straddling the target temperature, then selects the row matching the target. Targets: T=293.15 K for water (liquid phase), T=298.15 K for the gases (STP). The (T, P) point and the `Lemmon REFPROP equation of state` provenance are pinned in every `_sources.note`. Properties enriched (only those already in the schema): * mechanical.density (kg/m³ → g/cm³) * thermal.specific_heat (J/(g·K) → J/(kg·K)) * thermal.thermal_conductivity (W/(m·K), identity) Viscosity is in the WebBook payload but the runtime schema has no viscosity field today — skipped rather than invented. Temperature curves are out of scope (separate Phase-4 follow-up). Behaviour mirrors the other enrichers (#197/#198/#200/#201): * Default: comparison-only (DIFF threshold 2%). * --write: ADD-ONLY; existing values are NEVER overwritten. * --dry-run / --key / --report flags identical to the siblings. * Uses `cached_get_text` (30-day TTL) so NIST's polite-traffic expectations are respected; uses `build_source_row` and `writeback` from `scripts/_curation.py`. `_sources` rows: license = "PD-USGov" (allow-listed since #184), kind = "handbook", citation includes the species, ref pins `webbook.nist.gov:fluid.cgi?ID=<CAS>`. The report footer carries the NIST AS-IS notice. Attribution is not legally required for PD-USGov, and `LICENSES-DATA.md` only enumerates CC-BY/CC-BY-SA sources, so no edit there. Live run against the 5 fluids: 0 MISSING (every field is already curated), 1 DIFF (helium thermal_conductivity at 2.1% vs NIST 0.15531 W/(m·K)). The other 14 cells are within 2% of NIST. All existing values are preserved. Tests: `tests/test_nist_webbook_enrichment.py` (11 cases) covers TSV parsing, fixture-based --dry-run (offline via monkeypatched `cached_get_text`), --write add-only, no-overwrite semantics, comment round-trip, --report file output, and the PD-USGov allow-list canary. Fixture `tests/fixtures/nist_webbook_water.txt` holds a trimmed real WebBook response (header + 2 rows) so the test suite runs fully offline. Closes #159
7 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
src/pymat/data/*.tomlvia_curation.load_material_keysand resolves the Wikidata QID from[<material>.sourcing].wikidata(preferred) or the curator fallback dict; materials without a QID are skipped.ThermalPropertieshas noboiling_pointfield today), thermal conductivity (P2068, W/(m·K)), and heat capacity (P2056, J/(kg·K) + J/(g·K)). P2153 Young's modulus and P2055 resistivity are deferred — both need per-grade schema resolution that's out of scope; rationale captured in the module docstring.--writeflag implements conservative add-only writeback: when our value is missing and Wikidata has one, write the value plus a paired_sourcesrow built via_curation.build_source_row(kind=qid, license=CC0, note="Pxxxx via SPARQL "). Existing values are NEVER overwritten — DIFF cases stay advisory and only surface in the report.--report <path>flag redirects the markdown diff report to a file.--keyand--dry-runare preserved.requests+tomlkitstay inscripts/requirements-curation.txt.Heat-capacity unit judgement call
Wikidata mixes J/(kg·K) (Q752197), J/(g·K) (Q21075844 — multiply by 1000), and J/(mol·K) (Q13035094 — molar form). The first two are normalized; molar form is intentionally skipped because converting requires a molar-mass lookup and brings in molecule-vs-formula-unit ambiguity that's out of scope for #158. Unparseable units fall through to
Noneand surface as missing in the report.Test plan
uv run pytest tests/test_curation_helpers.py tests/test_wikidata_enrichment.py -v— 27 passed locallyuv run pytest(full suite) — 654 passed, 18 skipped (unrelated headless-render skips)uv run ruff check . && uv run ruff format --check .— cleanpython scripts/enrich_from_wikidata.py --key copper --dry-runproduces a sensible markdown report against the bundled fixtureCloses #158