feat(scripts): refractiveindex.info enricher for n,k dispersion (#164)#201
Merged
Conversation
Adds `scripts/enrich_from_refractiveindex.py`, a Phase-4 curation enricher that pulls n,k dispersion from the Polyanskiy refractiveindex.info database (CC0) and writes the `optical.refractive_index_dispersion` schema field added in #146/#152. Scope (per #164 Phase-4 design review): six scintillator host entries (`nai`, `nai.Tl`, `csi`, `csi.Tl`, `csi.Na`, `bgo`) and three metals (`aluminum`, `copper`, `gold`). BaF2/YAG/sapphire/fused-silica/PMMA are deferred — those entries either don't exist in mat's TOMLs yet (Phase 5 material adds) or warrant a follow-up enrichment PR. Implementation details: * `cached_get_text(url, *, ttl_days, source, suffix)` added to `scripts/_curation.py` as a sibling of the existing JSON-only `cached_get`. Persists payloads as `.txt` / `.yaml` / `.xml` under `scripts/.cache/<source>/`. Exported via `__all__`. * No git submodule for refractiveindex.info: the database is ~100 MB. Per-material fetches via the `raw.githubusercontent.com` mirror with a 30-day disk cache cover the curation workflow. * Both `tabulated nk` (metals) and `formula 1` Sellmeier (scintillator hosts) data types are handled. Sellmeier blocks are evaluated on a 50-point log-spaced grid spanning the database's declared `wavelength_range` so downstream Geant4 / OpticsBuilder consumers get a tabulated proxy without needing an embedded Sellmeier. * µm → nm conversion (×1000) applied at the parsing boundary. * ADD-ONLY writeback: existing `optical.refractive_index_dispersion` values are NEVER overwritten; the comparison still runs and logs the wavelength-range overlap so curators can spot mismatches. * `_sources` rows tagged `optical.refractive_index_dispersion` with `kind = "doi"` when the YAML's REFERENCES block contains a DOI link (the common case), `"handbook"` otherwise. License: `CC0`. Tests: `tests/test_refractiveindex_enrichment.py` (15 tests) covers the µm→nm conversion, Sellmeier evaluator, DOI extraction, --dry-run report shape, --write add-only behaviour, refusal to overwrite, source-row well-formedness, and the new `cached_get_text` helper. Fixture at `tests/fixtures/refractiveindex_sample.yaml` exercises the parser on a synthetic 4-row YAML. Curation deps: `pyyaml>=6.0` added to `scripts/requirements-curation.txt` (NOT a runtime dep of mat). Live verification (2026-05-07): all 9 ENTRY_MAP paths resolve on the upstream `main` branch; comparison report shows `MISSING` (so no collisions with curated values) for every target. Closes #164.
6 tasks
5 tasks
gerchowl
added a commit
that referenced
this pull request
May 6, 2026
…) (#203) Adds `scripts/enrich_from_nist_webbook.py`, a comparison + add-only writeback enricher for the five fluids that overlap between `src/pymat/data/{gases,liquids}.toml` and the NIST Chemistry WebBook: water (liquid), nitrogen, argon, helium, co2 (gas). NIST has no JSON API; the script hits the IsoBar TSV endpoint (`fluid.cgi?Action=Data&Type=IsoBar`) at a fixed pressure of 1.01325 bar (1 atm) over a 5 K range straddling the target temperature, then selects the row matching the target. Targets: T=293.15 K for water (liquid phase), T=298.15 K for the gases (STP). The (T, P) point and the `Lemmon REFPROP equation of state` provenance are pinned in every `_sources.note`. Properties enriched (only those already in the schema): * mechanical.density (kg/m³ → g/cm³) * thermal.specific_heat (J/(g·K) → J/(kg·K)) * thermal.thermal_conductivity (W/(m·K), identity) Viscosity is in the WebBook payload but the runtime schema has no viscosity field today — skipped rather than invented. Temperature curves are out of scope (separate Phase-4 follow-up). Behaviour mirrors the other enrichers (#197/#198/#200/#201): * Default: comparison-only (DIFF threshold 2%). * --write: ADD-ONLY; existing values are NEVER overwritten. * --dry-run / --key / --report flags identical to the siblings. * Uses `cached_get_text` (30-day TTL) so NIST's polite-traffic expectations are respected; uses `build_source_row` and `writeback` from `scripts/_curation.py`. `_sources` rows: license = "PD-USGov" (allow-listed since #184), kind = "handbook", citation includes the species, ref pins `webbook.nist.gov:fluid.cgi?ID=<CAS>`. The report footer carries the NIST AS-IS notice. Attribution is not legally required for PD-USGov, and `LICENSES-DATA.md` only enumerates CC-BY/CC-BY-SA sources, so no edit there. Live run against the 5 fluids: 0 MISSING (every field is already curated), 1 DIFF (helium thermal_conductivity at 2.1% vs NIST 0.15531 W/(m·K)). The other 14 cells are within 2% of NIST. All existing values are preserved. Tests: `tests/test_nist_webbook_enrichment.py` (11 cases) covers TSV parsing, fixture-based --dry-run (offline via monkeypatched `cached_get_text`), --write add-only, no-overwrite semantics, comment round-trip, --report file output, and the PD-USGov allow-list canary. Fixture `tests/fixtures/nist_webbook_water.txt` holds a trimmed real WebBook response (header + 2 rows) so the test suite runs fully offline. Closes #159
Merged
6 tasks
gerchowl
added a commit
that referenced
this pull request
May 7, 2026
…#134, #135, #136, #137) (#208) * feat(data): add 4 technical ceramics (AlN, LTCC 951, sapphire, Si3N4) Adds four primary-source-cited ceramic entries to `src/pymat/data/ceramics.toml`, with `_sources` provenance per property: - Aluminum nitride (AlN) — sintered AlN-170 grade. Vendor scalars from CoorsTek; Slack 1987 (DOI 10.1016/0022-3697(87)90153-3) cited as the intrinsic-thermal-conductivity review (~285 W/(m·K) single-crystal upper bound vs the 170 W/(m·K) commercial scalar). Anisotropic CTE [a,a,c] = [4.2e-6, 4.2e-6, 5.3e-6] / K. Closes #134. - LTCC DuPont 951 — DuPont GreenTape 951 public TDS (MCM951, 7/2011). All scalars verbatim from the "Typical Tape Properties" table. Includes x/y shrinkage 12.7 %, z shrinkage 15 %, dielectric constant at 10 GHz, and surface roughness in a `[ltcc951.custom]` block since the schema has no shrinkage field. License = proprietary-reference-only. Closes #135. - Sapphire (single-crystal Al2O3) — distinct entry from polycrystalline `alumina`. Refractive index n_o = 1.768 at sodium D-line from Malitson & Dodge 1972 (DOI 10.1364/JOSA.62.001405). Mechanical and thermal scalars from Crystran technical brief; hardness/toughness from Dobrovinskaya 2009 textbook (Springer ISBN 978-0-387-85695-7). Anisotropic CTE [a,a,c] = [5.0e-6, 5.0e-6, 5.6e-6] / K. Canonical scalars favor c-axis where the schema accepts a single value; perpendicular-c values documented in source notes. No refractive_index_dispersion populated — that's #201's enricher territory. Closes #136. - Silicon nitride (Si3N4) — sintered SSN/DPSSN grade, the most common engineering Si3N4. All scalars verbatim from the Superior Technical Ceramics 2021 datasheet (ASTM-method-stamped). Riley 2000 (DOI 10.1111/j.1151-2916.2000.tb01182.x) cited as the materials-class review. Closes #137. Each `_sources` entry uses the standard kind (doi/handbook/vendor) and license (proprietary-reference-only / CC-BY-SA-4.0 for Wikipedia). `note` fields document conditions (room temperature, polycrystalline vs single-crystal, grade designation) and any unit conversions. Updates `_CATEGORY_BASES["ceramics"]` in `src/pymat/__init__.py` to include the new top-level keys (mirrors PR #206 pattern). Validation: - `python -c "import tomllib; tomllib.loads(...)"` parses cleanly - `python scripts/check_licenses.py` passes (7 TOMLs scanned) - `pytest` — 627 passed, 25 skipped - `ruff check . && ruff format --check .` — all clean - Spot-check: each material's properties load and source_of() returns the cited primary source. * chore(typos): allowlist DuPont brand name (#135) `typos` was rewriting "DuPont" → "DuPoint" inside ceramics.toml citation notes for LTCC 951. Adding the brand to the allowlist; same pattern as the LSO and ANID entries above. * chore(typos): allowlist Pont substring (#135) `typos` 1.46 matches the substring "Pont" → "Point" inside any word, so allowlisting just "DuPont" doesn't suppress the rewrite — need to allowlist "Pont" itself. Verified locally that ceramics.toml no longer trips the hook.
7 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
scripts/enrich_from_refractiveindex.py— a Phase-4 curation enricher that pulls n,k dispersion from the Polyanskiy refractiveindex.info database (CC0) into theoptical.refractive_index_dispersionschema field (added in Add SF6 (dielectric gas) #146 / Schema: thermal sub-table additions #152).cached_get_text(url, *, ttl_days, source, suffix)toscripts/_curation.py— sibling of the existing JSON-onlycached_get, persists text/YAML/XML payloads underscripts/.cache/<source>/. Exported via__all__.nai,nai.Tl,csi,csi.Tl,csi.Na,bgo) + 3 metals (aluminum,copper,gold). BaF2/YAG/sapphire/fused-silica/PMMA deferred — those entries either aren't in mat's TOMLs yet or warrant a follow-up.tabulated nk(metals) andformula 1Sellmeier (scintillators) data types parsed; Sellmeier evaluated on a 50-point log grid spanning the upstreamwavelength_range.optical.refractive_index_dispersion,license = CC0,kind = doiwhen the YAML's REFERENCES block contains a DOI link.raw.githubusercontent.commirror with a 30-day disk cache.pyyaml>=6.0added toscripts/requirements-curation.txt(NOT a runtime dep).Live verification (2026-05-07): all 9
ENTRY_MAPpaths resolve on upstreammain; comparison report showsMISSING(so no collisions with curated values) for every target.Test plan
pytest tests/test_refractiveindex_enrichment.py tests/test_curation_helpers.py -v— 35 passedpytest --ignore=tests/test_visual_compare.py(full suite) — 652 passed, 34 skippedruff check . && ruff format --check .— cleanpython scripts/enrich_from_refractiveindex.py --dry-run— all 9 entries fetched, parsed, no DIFFpython scripts/enrich_from_refractiveindex.py --key bgo --write(then reverted) — produces well-formed inline TOML andoptical.refractive_index_dispersionsource row, all comments preservedCloses #164.