Skip to content

feat(scripts): extend Wikidata enricher to compounds + new properties (#158)#198

Merged
gerchowl merged 2 commits into
mainfrom
feature/158-wikidata-extend-coverage
May 6, 2026
Merged

feat(scripts): extend Wikidata enricher to compounds + new properties (#158)#198
gerchowl merged 2 commits into
mainfrom
feature/158-wikidata-extend-coverage

Conversation

@gerchowl
Copy link
Copy Markdown
Contributor

@gerchowl gerchowl commented May 6, 2026

Summary

  • Iterates every material in src/pymat/data/*.toml via _curation.load_material_keys and resolves the Wikidata QID from [<material>.sourcing].wikidata (preferred) or the curator fallback dict; materials without a QID are skipped.
  • Adds coverage for boiling point (P2102, report-only since ThermalProperties has no boiling_point field today), thermal conductivity (P2068, W/(m·K)), and heat capacity (P2056, J/(kg·K) + J/(g·K)). P2153 Young's modulus and P2055 resistivity are deferred — both need per-grade schema resolution that's out of scope; rationale captured in the module docstring.
  • New --write flag implements conservative add-only writeback: when our value is missing and Wikidata has one, write the value plus a paired _sources row built via _curation.build_source_row (kind=qid, license=CC0, note="Pxxxx via SPARQL "). Existing values are NEVER overwritten — DIFF cases stay advisory and only surface in the report.
  • New --report <path> flag redirects the markdown diff report to a file. --key and --dry-run are preserved.
  • No runtime changes, no new materials, no loader/properties.py touches; requests + tomlkit stay in scripts/requirements-curation.txt.

Heat-capacity unit judgement call

Wikidata mixes J/(kg·K) (Q752197), J/(g·K) (Q21075844 — multiply by 1000), and J/(mol·K) (Q13035094 — molar form). The first two are normalized; molar form is intentionally skipped because converting requires a molar-mass lookup and brings in molecule-vs-formula-unit ambiguity that's out of scope for #158. Unparseable units fall through to None and surface as missing in the report.

Test plan

  • uv run pytest tests/test_curation_helpers.py tests/test_wikidata_enrichment.py -v — 27 passed locally
  • uv run pytest (full suite) — 654 passed, 18 skipped (unrelated headless-render skips)
  • uv run ruff check . && uv run ruff format --check . — clean
  • python scripts/enrich_from_wikidata.py --key copper --dry-run produces a sensible markdown report against the bundled fixture

Closes #158

…#158)

Build on the Phase 4 prep PR (#197) that extracted the shared
curation helpers — this is the first Tier-1 Phase 4 implementation.

What changes in `scripts/enrich_from_wikidata.py`:

* **Coverage**: iterates every material in `src/pymat/data/*.toml`
  via `_curation.load_material_keys`, looks up the QID from
  `[<material>.sourcing].wikidata` (preferred) or the curator
  fallback dict, and skips materials without a QID. No new
  materials introduced — that's Phase 5.

* **New properties**: P2101 melting point and P2054 density (both
  pre-existing) plus P2102 boiling point, P2068 thermal
  conductivity, and P2056 heat capacity. P2153 Young's modulus and
  P2055 resistivity are deferred — both need per-grade resolution
  the current schema can't express; rationale captured in the
  module docstring.

* **`--write` flag (NEW)**: conservative add-only writeback. When
  our value is missing and Wikidata has one, the enricher writes
  the value plus a paired `_sources` row built via
  `_curation.build_source_row` (kind=qid, license=CC0, note=
  "Pxxxx via SPARQL <date>"). Existing values are NEVER
  overwritten — DIFF cases stay advisory and surface in the
  report only.

* **`--report <path>` flag (NEW)**: redirect the markdown report
  to a file instead of stdout. `--key` and `--dry-run` preserved.

Heat-capacity unit normalization handles J/(kg·K) and J/(g·K) (the
×1000 case); J/(mol·K) is intentionally skipped because converting
needs a molar-mass lookup that's out of scope for this PR.

Boiling point is fetched + reported but never written — the
schema has no `boiling_point` field on `ThermalProperties` today,
and adding one would touch runtime code (out of scope).

Tests in `tests/test_wikidata_enrichment.py` cover: `--dry-run`
output via the bundled fixture, `--write` add-only path
(missing-field + sources row attached), `--write` does NOT
overwrite divergent values, comment-preserving round-trip at the
grade level, comparison-only never mutates files, and `--report`
file output. Gated on `pytest.importorskip("requests"/"tomlkit")`
matching the `tests/test_curation_helpers.py` pattern.

No runtime changes — `requests`/`tomlkit` stay in
`scripts/requirements-curation.txt`. No new materials added; no
loader/properties.py touched.

Closes #158
@gerchowl gerchowl enabled auto-merge (squash) May 6, 2026 21:46
Caught by the pre-commit `typos` hook in CI; local install didn't
run hooks because they weren't installed in this worktree.
@gerchowl gerchowl merged commit e5692e3 into main May 6, 2026
19 checks passed
@vig-os-release-app vig-os-release-app Bot mentioned this pull request May 6, 2026
6 tasks
gerchowl added a commit that referenced this pull request May 6, 2026
#167) (#200)

Adds scripts/enrich_from_geant4_nist.py — a sibling to enrich_from_wikidata.py
(#198) that cross-checks scintillator and plastic entries against the constants
shipped in Geant4 v11.2.0's G4NistMaterialBuilder.cc. Uses the shared curation
helpers from #197 (load_material_keys, build_source_row, writeback, fmt_delta).

Coverage:
- density (g/cm³, comparison + add-only writeback)
- mean_excitation_energy_eV (Geant4's mean ionisation potential — schema field
  added in #157), written into [<material>.nuclear]
- composition: skipped — py-mat has no schema field for element-fraction
  arrays today; documented in the module docstring

The Geant4 numbers are mirrored by hand into a G4_NIST dict pinned to v11.2.0,
with line-number citations into G4NistMaterialBuilder.cc. We do NOT fetch the
source at runtime — minor versions are stable across these constants and
curation tooling must be reproducible offline.

CONTROVERSIAL: extends the license allow-list with `Geant4-SL` (Geant4
Software License — BSD-like with attribution). Touches:
- scripts/check_licenses.py:ALLOWED
- scripts/_curation.py:LICENSE_ALLOWLIST
- docs/data-policy.md (new row in Allowed-licenses + clarifying paragraph)
- tests/test_check_licenses.py (parametrized accept-list)

Rationale for picking "extend the allow-list" over "use proprietary-reference-
only": the Geant4 SL is materially more permissive than the proprietary-
reference label implies (full redistribution allowed with attribution), and
distinguishing it explicitly sets the right precedent for the BSD-like sources
we'll hit next (HEPData mirrors, third-party NIST compilations). If a reviewer
prefers Option B (collapse into proprietary-reference-only), the change is a
3-line revert in _curation/check_licenses + a paragraph deletion in data-
policy.md. Calling this out so the decision is reviewable rather than buried.

Materials matched (10): bgo, nai, nai.Tl, csi, csi.Tl, csi.Na, plastic_scint,
plastic_scint.BC400, plastic_scint.EJ200, pwo (density only — G4_PbWO4 carries
pot=0.0, the compute-from-composition sentinel).

Plastics matched (7): pmma (→ G4_LUCITE), pc, ptfe, pe, nylon, delrin, pctfe.

Skipped with logged reason: lyso, lyso.Ce, labr3, peek, ultem, esr, pla,
abs, petg, tpu, vespel, torlon (and their grade-level descendants).

Tests: 10 new in tests/test_geant4_enrichment.py covering --dry-run,
--write add-only, no-overwrite-existing, comment-preserving round-trip,
license-allowlist canary, source-row sanity, and the PWO-MEE-is-None edge
case. Existing curation/Wikidata/check-licenses tests: 44 pass unchanged.

Closes #167
gerchowl added a commit that referenced this pull request May 6, 2026
…) (#203)

Adds `scripts/enrich_from_nist_webbook.py`, a comparison + add-only
writeback enricher for the five fluids that overlap between
`src/pymat/data/{gases,liquids}.toml` and the NIST Chemistry WebBook:
water (liquid), nitrogen, argon, helium, co2 (gas).

NIST has no JSON API; the script hits the IsoBar TSV endpoint
(`fluid.cgi?Action=Data&Type=IsoBar`) at a fixed pressure of 1.01325
bar (1 atm) over a 5 K range straddling the target temperature, then
selects the row matching the target. Targets: T=293.15 K for water
(liquid phase), T=298.15 K for the gases (STP). The (T, P) point and
the `Lemmon REFPROP equation of state` provenance are pinned in every
`_sources.note`.

Properties enriched (only those already in the schema):
  * mechanical.density          (kg/m³ → g/cm³)
  * thermal.specific_heat       (J/(g·K) → J/(kg·K))
  * thermal.thermal_conductivity (W/(m·K), identity)

Viscosity is in the WebBook payload but the runtime schema has no
viscosity field today — skipped rather than invented. Temperature
curves are out of scope (separate Phase-4 follow-up).

Behaviour mirrors the other enrichers (#197/#198/#200/#201):
  * Default: comparison-only (DIFF threshold 2%).
  * --write: ADD-ONLY; existing values are NEVER overwritten.
  * --dry-run / --key / --report flags identical to the siblings.
  * Uses `cached_get_text` (30-day TTL) so NIST's polite-traffic
    expectations are respected; uses `build_source_row` and
    `writeback` from `scripts/_curation.py`.

`_sources` rows: license = "PD-USGov" (allow-listed since #184), kind
= "handbook", citation includes the species, ref pins
`webbook.nist.gov:fluid.cgi?ID=<CAS>`. The report footer carries the
NIST AS-IS notice. Attribution is not legally required for PD-USGov,
and `LICENSES-DATA.md` only enumerates CC-BY/CC-BY-SA sources, so no
edit there.

Live run against the 5 fluids: 0 MISSING (every field is already
curated), 1 DIFF (helium thermal_conductivity at 2.1% vs NIST
0.15531 W/(m·K)). The other 14 cells are within 2% of NIST. All
existing values are preserved.

Tests: `tests/test_nist_webbook_enrichment.py` (11 cases) covers TSV
parsing, fixture-based --dry-run (offline via monkeypatched
`cached_get_text`), --write add-only, no-overwrite semantics, comment
round-trip, --report file output, and the PD-USGov allow-list
canary. Fixture `tests/fixtures/nist_webbook_water.txt` holds a
trimmed real WebBook response (header + 2 rows) so the test suite
runs fully offline.

Closes #159
@vig-os-release-app vig-os-release-app Bot mentioned this pull request May 7, 2026
7 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Integrate Wikidata (CC0) — SPARQL bulk fetch for elements + compounds

1 participant