Skip to content

feat(scripts): shared curation helpers (Phase 4 prep)#197

Merged
gerchowl merged 3 commits into
mainfrom
feature/phase4-prep-curation-helpers
May 6, 2026
Merged

feat(scripts): shared curation helpers (Phase 4 prep)#197
gerchowl merged 3 commits into
mainfrom
feature/phase4-prep-curation-helpers

Conversation

@gerchowl
Copy link
Copy Markdown
Contributor

@gerchowl gerchowl commented May 6, 2026

Summary

What this is NOT

  • No new abstractions for hypothetical future enrichers (no plugin registry, no event hooks, no protocol classes).
  • No Python floor bump.
  • No changes to src/pymat/data/*.toml content.

Test plan

  • uv run pytest tests/test_curation_helpers.py -v — 20 new tests pass (cache hit, TTL expiry, POST cache, UnitNormalizer rejects unknown source/unit, scale applied, build_source_row validates kind/license/citation/ref, writeback preserves comments, attaches _sources, returns False when unchanged, load_material_keys for metals.toml + synthetic fixture, --dry-run consumes the fixture without network)
  • uv run pytest — full suite: 646 passed, 19 skipped (pre-existing skips), 1 warning (pre-existing)
  • uv run ruff check and uv run ruff format — clean
  • uv run python scripts/enrich_from_wikidata.py --dry-run — fixture-based output identical in shape to pre-refactor live output
  • uv run python scripts/enrich_from_wikidata.py --key copper — live SPARQL hit, output unchanged
  • git check-ignore -v scripts/.cache/... — confirmed gitignored
  • uv run python scripts/check_licenses.py — passes (no data corpus changes)

Lift the duplicated HTTP fetch / unit normalization / source-row /
TOML writeback logic out of `scripts/enrich_from_wikidata.py` into a
new `scripts/_curation.py` module so the upcoming per-source enrichers
(#158, #159, #164, #165, #166, #167) reuse a single audited
implementation. No behavior change to the existing Wikidata enricher;
its CLI args, output format, and exit codes are unchanged.

What's in `_curation.py`:
- `cached_get(url, params, *, ttl_days, source)` — disk cache under
  `scripts/.cache/<source>/<sha256>.json` with TTL refresh
- `UnitNormalizer` — registry mapping `(source, source_unit)` to a
  canonical `pymat.units.STANDARD_UNITS` string, with optional scale
- `build_source_row(citation, kind, ref, license, note=None)` —
  validates `kind` and `license` against the allow-list mirrored from
  `scripts/check_licenses.py`
- `writeback(toml_path, material_path, updates, sources=None)` —
  comment-preserving TOML round-trip via `tomlkit`
- `load_material_keys(category)` — enumerate dotted material paths
  in `src/pymat/data/<category>.toml`
- `fmt_delta` — unchanged side-by-side comparison cell formatter

`scripts/enrich_from_wikidata.py` now imports `USER_AGENT` and
`fmt_delta` from the shared module and gains a `--dry-run` flag that
reads `tests/fixtures/wikidata_sample.json` so the integration is
reproducible without network. Live SPARQL behavior verified
unchanged.

`tomlkit>=0.12` added to `scripts/requirements-curation.txt`.
`scripts/.cache/` added to `.gitignore` (verified gitignored).
@gerchowl gerchowl enabled auto-merge (squash) May 6, 2026 21:33
gerchowl and others added 2 commits May 6, 2026 23:35
The Phase 4 prep tests import `_curation`, which depends on
`requests` and `tomlkit` — both curation-time-only deps living in
`scripts/requirements-curation.txt`, not the main `[dev]` extras.
CI's `uv run pytest` install path doesn't pick those up, so the
test module was failing collection on every Python matrix entry.

Gate the module with `pytest.importorskip` for both deps. Local
runs with curation deps installed continue to exercise all 20
tests; CI now skips the module cleanly until those deps land in
the dev extras (which we don't want — keeps the runtime install
lean for build123d consumers).
@gerchowl gerchowl merged commit e4e507b into main May 6, 2026
18 checks passed
@vig-os-release-app vig-os-release-app Bot mentioned this pull request May 6, 2026
6 tasks
gerchowl added a commit that referenced this pull request May 6, 2026
…#158) (#198)

* feat(scripts): extend Wikidata enricher to compounds + new properties (#158)

Build on the Phase 4 prep PR (#197) that extracted the shared
curation helpers — this is the first Tier-1 Phase 4 implementation.

What changes in `scripts/enrich_from_wikidata.py`:

* **Coverage**: iterates every material in `src/pymat/data/*.toml`
  via `_curation.load_material_keys`, looks up the QID from
  `[<material>.sourcing].wikidata` (preferred) or the curator
  fallback dict, and skips materials without a QID. No new
  materials introduced — that's Phase 5.

* **New properties**: P2101 melting point and P2054 density (both
  pre-existing) plus P2102 boiling point, P2068 thermal
  conductivity, and P2056 heat capacity. P2153 Young's modulus and
  P2055 resistivity are deferred — both need per-grade resolution
  the current schema can't express; rationale captured in the
  module docstring.

* **`--write` flag (NEW)**: conservative add-only writeback. When
  our value is missing and Wikidata has one, the enricher writes
  the value plus a paired `_sources` row built via
  `_curation.build_source_row` (kind=qid, license=CC0, note=
  "Pxxxx via SPARQL <date>"). Existing values are NEVER
  overwritten — DIFF cases stay advisory and surface in the
  report only.

* **`--report <path>` flag (NEW)**: redirect the markdown report
  to a file instead of stdout. `--key` and `--dry-run` preserved.

Heat-capacity unit normalization handles J/(kg·K) and J/(g·K) (the
×1000 case); J/(mol·K) is intentionally skipped because converting
needs a molar-mass lookup that's out of scope for this PR.

Boiling point is fetched + reported but never written — the
schema has no `boiling_point` field on `ThermalProperties` today,
and adding one would touch runtime code (out of scope).

Tests in `tests/test_wikidata_enrichment.py` cover: `--dry-run`
output via the bundled fixture, `--write` add-only path
(missing-field + sources row attached), `--write` does NOT
overwrite divergent values, comment-preserving round-trip at the
grade level, comparison-only never mutates files, and `--report`
file output. Gated on `pytest.importorskip("requests"/"tomlkit")`
matching the `tests/test_curation_helpers.py` pattern.

No runtime changes — `requests`/`tomlkit` stay in
`scripts/requirements-curation.txt`. No new materials added; no
loader/properties.py touched.

Closes #158

* fix(scripts): typo Unparseable → Unparsable

Caught by the pre-commit `typos` hook in CI; local install didn't
run hooks because they weren't installed in this worktree.
gerchowl added a commit that referenced this pull request May 6, 2026
#167) (#200)

Adds scripts/enrich_from_geant4_nist.py — a sibling to enrich_from_wikidata.py
(#198) that cross-checks scintillator and plastic entries against the constants
shipped in Geant4 v11.2.0's G4NistMaterialBuilder.cc. Uses the shared curation
helpers from #197 (load_material_keys, build_source_row, writeback, fmt_delta).

Coverage:
- density (g/cm³, comparison + add-only writeback)
- mean_excitation_energy_eV (Geant4's mean ionisation potential — schema field
  added in #157), written into [<material>.nuclear]
- composition: skipped — py-mat has no schema field for element-fraction
  arrays today; documented in the module docstring

The Geant4 numbers are mirrored by hand into a G4_NIST dict pinned to v11.2.0,
with line-number citations into G4NistMaterialBuilder.cc. We do NOT fetch the
source at runtime — minor versions are stable across these constants and
curation tooling must be reproducible offline.

CONTROVERSIAL: extends the license allow-list with `Geant4-SL` (Geant4
Software License — BSD-like with attribution). Touches:
- scripts/check_licenses.py:ALLOWED
- scripts/_curation.py:LICENSE_ALLOWLIST
- docs/data-policy.md (new row in Allowed-licenses + clarifying paragraph)
- tests/test_check_licenses.py (parametrized accept-list)

Rationale for picking "extend the allow-list" over "use proprietary-reference-
only": the Geant4 SL is materially more permissive than the proprietary-
reference label implies (full redistribution allowed with attribution), and
distinguishing it explicitly sets the right precedent for the BSD-like sources
we'll hit next (HEPData mirrors, third-party NIST compilations). If a reviewer
prefers Option B (collapse into proprietary-reference-only), the change is a
3-line revert in _curation/check_licenses + a paragraph deletion in data-
policy.md. Calling this out so the decision is reviewable rather than buried.

Materials matched (10): bgo, nai, nai.Tl, csi, csi.Tl, csi.Na, plastic_scint,
plastic_scint.BC400, plastic_scint.EJ200, pwo (density only — G4_PbWO4 carries
pot=0.0, the compute-from-composition sentinel).

Plastics matched (7): pmma (→ G4_LUCITE), pc, ptfe, pe, nylon, delrin, pctfe.

Skipped with logged reason: lyso, lyso.Ce, labr3, peek, ultem, esr, pla,
abs, petg, tpu, vespel, torlon (and their grade-level descendants).

Tests: 10 new in tests/test_geant4_enrichment.py covering --dry-run,
--write add-only, no-overwrite-existing, comment-preserving round-trip,
license-allowlist canary, source-row sanity, and the PWO-MEE-is-None edge
case. Existing curation/Wikidata/check-licenses tests: 44 pass unchanged.

Closes #167
gerchowl added a commit that referenced this pull request May 6, 2026
…) (#203)

Adds `scripts/enrich_from_nist_webbook.py`, a comparison + add-only
writeback enricher for the five fluids that overlap between
`src/pymat/data/{gases,liquids}.toml` and the NIST Chemistry WebBook:
water (liquid), nitrogen, argon, helium, co2 (gas).

NIST has no JSON API; the script hits the IsoBar TSV endpoint
(`fluid.cgi?Action=Data&Type=IsoBar`) at a fixed pressure of 1.01325
bar (1 atm) over a 5 K range straddling the target temperature, then
selects the row matching the target. Targets: T=293.15 K for water
(liquid phase), T=298.15 K for the gases (STP). The (T, P) point and
the `Lemmon REFPROP equation of state` provenance are pinned in every
`_sources.note`.

Properties enriched (only those already in the schema):
  * mechanical.density          (kg/m³ → g/cm³)
  * thermal.specific_heat       (J/(g·K) → J/(kg·K))
  * thermal.thermal_conductivity (W/(m·K), identity)

Viscosity is in the WebBook payload but the runtime schema has no
viscosity field today — skipped rather than invented. Temperature
curves are out of scope (separate Phase-4 follow-up).

Behaviour mirrors the other enrichers (#197/#198/#200/#201):
  * Default: comparison-only (DIFF threshold 2%).
  * --write: ADD-ONLY; existing values are NEVER overwritten.
  * --dry-run / --key / --report flags identical to the siblings.
  * Uses `cached_get_text` (30-day TTL) so NIST's polite-traffic
    expectations are respected; uses `build_source_row` and
    `writeback` from `scripts/_curation.py`.

`_sources` rows: license = "PD-USGov" (allow-listed since #184), kind
= "handbook", citation includes the species, ref pins
`webbook.nist.gov:fluid.cgi?ID=<CAS>`. The report footer carries the
NIST AS-IS notice. Attribution is not legally required for PD-USGov,
and `LICENSES-DATA.md` only enumerates CC-BY/CC-BY-SA sources, so no
edit there.

Live run against the 5 fluids: 0 MISSING (every field is already
curated), 1 DIFF (helium thermal_conductivity at 2.1% vs NIST
0.15531 W/(m·K)). The other 14 cells are within 2% of NIST. All
existing values are preserved.

Tests: `tests/test_nist_webbook_enrichment.py` (11 cases) covers TSV
parsing, fixture-based --dry-run (offline via monkeypatched
`cached_get_text`), --write add-only, no-overwrite semantics, comment
round-trip, --report file output, and the PD-USGov allow-list
canary. Fixture `tests/fixtures/nist_webbook_water.txt` holds a
trimmed real WebBook response (header + 2 rows) so the test suite
runs fully offline.

Closes #159
@vig-os-release-app vig-os-release-app Bot mentioned this pull request May 7, 2026
7 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant