feat(scripts): shared curation helpers (Phase 4 prep)#197
Merged
Conversation
Lift the duplicated HTTP fetch / unit normalization / source-row / TOML writeback logic out of `scripts/enrich_from_wikidata.py` into a new `scripts/_curation.py` module so the upcoming per-source enrichers (#158, #159, #164, #165, #166, #167) reuse a single audited implementation. No behavior change to the existing Wikidata enricher; its CLI args, output format, and exit codes are unchanged. What's in `_curation.py`: - `cached_get(url, params, *, ttl_days, source)` — disk cache under `scripts/.cache/<source>/<sha256>.json` with TTL refresh - `UnitNormalizer` — registry mapping `(source, source_unit)` to a canonical `pymat.units.STANDARD_UNITS` string, with optional scale - `build_source_row(citation, kind, ref, license, note=None)` — validates `kind` and `license` against the allow-list mirrored from `scripts/check_licenses.py` - `writeback(toml_path, material_path, updates, sources=None)` — comment-preserving TOML round-trip via `tomlkit` - `load_material_keys(category)` — enumerate dotted material paths in `src/pymat/data/<category>.toml` - `fmt_delta` — unchanged side-by-side comparison cell formatter `scripts/enrich_from_wikidata.py` now imports `USER_AGENT` and `fmt_delta` from the shared module and gains a `--dry-run` flag that reads `tests/fixtures/wikidata_sample.json` so the integration is reproducible without network. Live SPARQL behavior verified unchanged. `tomlkit>=0.12` added to `scripts/requirements-curation.txt`. `scripts/.cache/` added to `.gitignore` (verified gitignored).
The Phase 4 prep tests import `_curation`, which depends on `requests` and `tomlkit` — both curation-time-only deps living in `scripts/requirements-curation.txt`, not the main `[dev]` extras. CI's `uv run pytest` install path doesn't pick those up, so the test module was failing collection on every Python matrix entry. Gate the module with `pytest.importorskip` for both deps. Local runs with curation deps installed continue to exercise all 20 tests; CI now skips the module cleanly until those deps land in the dev extras (which we don't want — keeps the runtime install lean for build123d consumers).
6 tasks
gerchowl
added a commit
that referenced
this pull request
May 6, 2026
…#158) (#198) * feat(scripts): extend Wikidata enricher to compounds + new properties (#158) Build on the Phase 4 prep PR (#197) that extracted the shared curation helpers — this is the first Tier-1 Phase 4 implementation. What changes in `scripts/enrich_from_wikidata.py`: * **Coverage**: iterates every material in `src/pymat/data/*.toml` via `_curation.load_material_keys`, looks up the QID from `[<material>.sourcing].wikidata` (preferred) or the curator fallback dict, and skips materials without a QID. No new materials introduced — that's Phase 5. * **New properties**: P2101 melting point and P2054 density (both pre-existing) plus P2102 boiling point, P2068 thermal conductivity, and P2056 heat capacity. P2153 Young's modulus and P2055 resistivity are deferred — both need per-grade resolution the current schema can't express; rationale captured in the module docstring. * **`--write` flag (NEW)**: conservative add-only writeback. When our value is missing and Wikidata has one, the enricher writes the value plus a paired `_sources` row built via `_curation.build_source_row` (kind=qid, license=CC0, note= "Pxxxx via SPARQL <date>"). Existing values are NEVER overwritten — DIFF cases stay advisory and surface in the report only. * **`--report <path>` flag (NEW)**: redirect the markdown report to a file instead of stdout. `--key` and `--dry-run` preserved. Heat-capacity unit normalization handles J/(kg·K) and J/(g·K) (the ×1000 case); J/(mol·K) is intentionally skipped because converting needs a molar-mass lookup that's out of scope for this PR. Boiling point is fetched + reported but never written — the schema has no `boiling_point` field on `ThermalProperties` today, and adding one would touch runtime code (out of scope). Tests in `tests/test_wikidata_enrichment.py` cover: `--dry-run` output via the bundled fixture, `--write` add-only path (missing-field + sources row attached), `--write` does NOT overwrite divergent values, comment-preserving round-trip at the grade level, comparison-only never mutates files, and `--report` file output. Gated on `pytest.importorskip("requests"/"tomlkit")` matching the `tests/test_curation_helpers.py` pattern. No runtime changes — `requests`/`tomlkit` stay in `scripts/requirements-curation.txt`. No new materials added; no loader/properties.py touched. Closes #158 * fix(scripts): typo Unparseable → Unparsable Caught by the pre-commit `typos` hook in CI; local install didn't run hooks because they weren't installed in this worktree.
5 tasks
gerchowl
added a commit
that referenced
this pull request
May 6, 2026
#167) (#200) Adds scripts/enrich_from_geant4_nist.py — a sibling to enrich_from_wikidata.py (#198) that cross-checks scintillator and plastic entries against the constants shipped in Geant4 v11.2.0's G4NistMaterialBuilder.cc. Uses the shared curation helpers from #197 (load_material_keys, build_source_row, writeback, fmt_delta). Coverage: - density (g/cm³, comparison + add-only writeback) - mean_excitation_energy_eV (Geant4's mean ionisation potential — schema field added in #157), written into [<material>.nuclear] - composition: skipped — py-mat has no schema field for element-fraction arrays today; documented in the module docstring The Geant4 numbers are mirrored by hand into a G4_NIST dict pinned to v11.2.0, with line-number citations into G4NistMaterialBuilder.cc. We do NOT fetch the source at runtime — minor versions are stable across these constants and curation tooling must be reproducible offline. CONTROVERSIAL: extends the license allow-list with `Geant4-SL` (Geant4 Software License — BSD-like with attribution). Touches: - scripts/check_licenses.py:ALLOWED - scripts/_curation.py:LICENSE_ALLOWLIST - docs/data-policy.md (new row in Allowed-licenses + clarifying paragraph) - tests/test_check_licenses.py (parametrized accept-list) Rationale for picking "extend the allow-list" over "use proprietary-reference- only": the Geant4 SL is materially more permissive than the proprietary- reference label implies (full redistribution allowed with attribution), and distinguishing it explicitly sets the right precedent for the BSD-like sources we'll hit next (HEPData mirrors, third-party NIST compilations). If a reviewer prefers Option B (collapse into proprietary-reference-only), the change is a 3-line revert in _curation/check_licenses + a paragraph deletion in data- policy.md. Calling this out so the decision is reviewable rather than buried. Materials matched (10): bgo, nai, nai.Tl, csi, csi.Tl, csi.Na, plastic_scint, plastic_scint.BC400, plastic_scint.EJ200, pwo (density only — G4_PbWO4 carries pot=0.0, the compute-from-composition sentinel). Plastics matched (7): pmma (→ G4_LUCITE), pc, ptfe, pe, nylon, delrin, pctfe. Skipped with logged reason: lyso, lyso.Ce, labr3, peek, ultem, esr, pla, abs, petg, tpu, vespel, torlon (and their grade-level descendants). Tests: 10 new in tests/test_geant4_enrichment.py covering --dry-run, --write add-only, no-overwrite-existing, comment-preserving round-trip, license-allowlist canary, source-row sanity, and the PWO-MEE-is-None edge case. Existing curation/Wikidata/check-licenses tests: 44 pass unchanged. Closes #167
5 tasks
gerchowl
added a commit
that referenced
this pull request
May 6, 2026
…) (#203) Adds `scripts/enrich_from_nist_webbook.py`, a comparison + add-only writeback enricher for the five fluids that overlap between `src/pymat/data/{gases,liquids}.toml` and the NIST Chemistry WebBook: water (liquid), nitrogen, argon, helium, co2 (gas). NIST has no JSON API; the script hits the IsoBar TSV endpoint (`fluid.cgi?Action=Data&Type=IsoBar`) at a fixed pressure of 1.01325 bar (1 atm) over a 5 K range straddling the target temperature, then selects the row matching the target. Targets: T=293.15 K for water (liquid phase), T=298.15 K for the gases (STP). The (T, P) point and the `Lemmon REFPROP equation of state` provenance are pinned in every `_sources.note`. Properties enriched (only those already in the schema): * mechanical.density (kg/m³ → g/cm³) * thermal.specific_heat (J/(g·K) → J/(kg·K)) * thermal.thermal_conductivity (W/(m·K), identity) Viscosity is in the WebBook payload but the runtime schema has no viscosity field today — skipped rather than invented. Temperature curves are out of scope (separate Phase-4 follow-up). Behaviour mirrors the other enrichers (#197/#198/#200/#201): * Default: comparison-only (DIFF threshold 2%). * --write: ADD-ONLY; existing values are NEVER overwritten. * --dry-run / --key / --report flags identical to the siblings. * Uses `cached_get_text` (30-day TTL) so NIST's polite-traffic expectations are respected; uses `build_source_row` and `writeback` from `scripts/_curation.py`. `_sources` rows: license = "PD-USGov" (allow-listed since #184), kind = "handbook", citation includes the species, ref pins `webbook.nist.gov:fluid.cgi?ID=<CAS>`. The report footer carries the NIST AS-IS notice. Attribution is not legally required for PD-USGov, and `LICENSES-DATA.md` only enumerates CC-BY/CC-BY-SA sources, so no edit there. Live run against the 5 fluids: 0 MISSING (every field is already curated), 1 DIFF (helium thermal_conductivity at 2.1% vs NIST 0.15531 W/(m·K)). The other 14 cells are within 2% of NIST. All existing values are preserved. Tests: `tests/test_nist_webbook_enrichment.py` (11 cases) covers TSV parsing, fixture-based --dry-run (offline via monkeypatched `cached_get_text`), --write add-only, no-overwrite semantics, comment round-trip, --report file output, and the PD-USGov allow-list canary. Fixture `tests/fixtures/nist_webbook_water.txt` holds a trimmed real WebBook response (header + 2 rows) so the test suite runs fully offline. Closes #159
7 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
_sourcesrow builder, comment-preserving TOML writeback, and material-key enumeration fromscripts/enrich_from_wikidata.pyinto a newscripts/_curation.pymodule so the upcoming per-source enrichers (Integrate Wikidata (CC0) — SPARQL bulk fetch for elements + compounds #158, Integrate NIST Chemistry WebBook (SRD 69) — gases & liquids #159, Integrate refractiveindex.info (CC0) — n,k dispersion library #164, Integrate PubChem (public domain) — pure-compound chemistry scalars #165, Integrate MIL-HDBK-5J (public domain) — aerospace alloy mechanicals #166, Mirror Geant4 G4NistManager constants (BSD-like) — composition baselines #167) share a single audited implementation.--dry-runflag backed bytests/fixtures/wikidata_sample.jsonfor offline reproducibility.tomlkit>=0.12toscripts/requirements-curation.txt; gitignorescripts/.cache/.What this is NOT
src/pymat/data/*.tomlcontent.Test plan
uv run pytest tests/test_curation_helpers.py -v— 20 new tests pass (cache hit, TTL expiry, POST cache,UnitNormalizerrejects unknown source/unit, scale applied,build_source_rowvalidateskind/license/citation/ref, writeback preserves comments, attaches_sources, returns False when unchanged,load_material_keysformetals.toml+ synthetic fixture,--dry-runconsumes the fixture without network)uv run pytest— full suite: 646 passed, 19 skipped (pre-existing skips), 1 warning (pre-existing)uv run ruff checkanduv run ruff format— cleanuv run python scripts/enrich_from_wikidata.py --dry-run— fixture-based output identical in shape to pre-refactor live outputuv run python scripts/enrich_from_wikidata.py --key copper— live SPARQL hit, output unchangedgit check-ignore -v scripts/.cache/...— confirmed gitignoreduv run python scripts/check_licenses.py— passes (no data corpus changes)