Skip to content

feat(scripts): refractiveindex.info enricher for n,k dispersion (#164)#201

Merged
gerchowl merged 1 commit into
mainfrom
feature/164-refractiveindex-info
May 6, 2026
Merged

feat(scripts): refractiveindex.info enricher for n,k dispersion (#164)#201
gerchowl merged 1 commit into
mainfrom
feature/164-refractiveindex-info

Conversation

@gerchowl
Copy link
Copy Markdown
Contributor

@gerchowl gerchowl commented May 6, 2026

Summary

  • Adds scripts/enrich_from_refractiveindex.py — a Phase-4 curation enricher that pulls n,k dispersion from the Polyanskiy refractiveindex.info database (CC0) into the optical.refractive_index_dispersion schema field (added in Add SF6 (dielectric gas) #146 / Schema: thermal sub-table additions #152).
  • Adds cached_get_text(url, *, ttl_days, source, suffix) to scripts/_curation.py — sibling of the existing JSON-only cached_get, persists text/YAML/XML payloads under scripts/.cache/<source>/. Exported via __all__.
  • Scope per Integrate refractiveindex.info (CC0) — n,k dispersion library #164 Phase-4 review: 6 scintillator hosts (nai, nai.Tl, csi, csi.Tl, csi.Na, bgo) + 3 metals (aluminum, copper, gold). BaF2/YAG/sapphire/fused-silica/PMMA deferred — those entries either aren't in mat's TOMLs yet or warrant a follow-up.
  • Both tabulated nk (metals) and formula 1 Sellmeier (scintillators) data types parsed; Sellmeier evaluated on a 50-point log grid spanning the upstream wavelength_range.
  • ADD-ONLY writeback: existing dispersion values are NEVER overwritten. Source rows tagged optical.refractive_index_dispersion, license = CC0, kind = doi when the YAML's REFERENCES block contains a DOI link.
  • No submodule (database is ~100 MB) — per-material fetches via the raw.githubusercontent.com mirror with a 30-day disk cache.
  • pyyaml>=6.0 added to scripts/requirements-curation.txt (NOT a runtime dep).

Live verification (2026-05-07): all 9 ENTRY_MAP paths resolve on upstream main; comparison report shows MISSING (so no collisions with curated values) for every target.

material db_path type rows wmin_nm wmax_nm k?
nai / nai.Tl main/NaI/nk/Li formula 1 50 250 40000 no
csi / csi.Tl / csi.Na main/CsI/nk/Li formula 1 50 250 67000 no
bgo main/Bi4Ge3O12/nk/Williams formula 1 50 305 1000 no
aluminum main/Al/nk/Rakic-LD tabulated nk 1000 62 247970 yes
copper main/Cu/nk/Johnson tabulated nk 49 188 1937 yes
gold main/Au/nk/Johnson tabulated nk 49 188 1937 yes

Test plan

  • pytest tests/test_refractiveindex_enrichment.py tests/test_curation_helpers.py -v — 35 passed
  • pytest --ignore=tests/test_visual_compare.py (full suite) — 652 passed, 34 skipped
  • ruff check . && ruff format --check . — clean
  • Live python scripts/enrich_from_refractiveindex.py --dry-run — all 9 entries fetched, parsed, no DIFF
  • Live python scripts/enrich_from_refractiveindex.py --key bgo --write (then reverted) — produces well-formed inline TOML and optical.refractive_index_dispersion source row, all comments preserved

Closes #164.

Adds `scripts/enrich_from_refractiveindex.py`, a Phase-4 curation
enricher that pulls n,k dispersion from the Polyanskiy
refractiveindex.info database (CC0) and writes the
`optical.refractive_index_dispersion` schema field added in #146/#152.

Scope (per #164 Phase-4 design review): six scintillator host entries
(`nai`, `nai.Tl`, `csi`, `csi.Tl`, `csi.Na`, `bgo`) and three metals
(`aluminum`, `copper`, `gold`). BaF2/YAG/sapphire/fused-silica/PMMA
are deferred — those entries either don't exist in mat's TOMLs yet
(Phase 5 material adds) or warrant a follow-up enrichment PR.

Implementation details:

* `cached_get_text(url, *, ttl_days, source, suffix)` added to
  `scripts/_curation.py` as a sibling of the existing JSON-only
  `cached_get`. Persists payloads as `.txt` / `.yaml` / `.xml` under
  `scripts/.cache/<source>/`. Exported via `__all__`.
* No git submodule for refractiveindex.info: the database is ~100 MB.
  Per-material fetches via the `raw.githubusercontent.com` mirror
  with a 30-day disk cache cover the curation workflow.
* Both `tabulated nk` (metals) and `formula 1` Sellmeier (scintillator
  hosts) data types are handled. Sellmeier blocks are evaluated on a
  50-point log-spaced grid spanning the database's declared
  `wavelength_range` so downstream Geant4 / OpticsBuilder consumers
  get a tabulated proxy without needing an embedded Sellmeier.
* µm → nm conversion (×1000) applied at the parsing boundary.
* ADD-ONLY writeback: existing `optical.refractive_index_dispersion`
  values are NEVER overwritten; the comparison still runs and logs the
  wavelength-range overlap so curators can spot mismatches.
* `_sources` rows tagged `optical.refractive_index_dispersion` with
  `kind = "doi"` when the YAML's REFERENCES block contains a DOI link
  (the common case), `"handbook"` otherwise. License: `CC0`.

Tests: `tests/test_refractiveindex_enrichment.py` (15 tests) covers
the µm→nm conversion, Sellmeier evaluator, DOI extraction, --dry-run
report shape, --write add-only behaviour, refusal to overwrite,
source-row well-formedness, and the new `cached_get_text` helper.
Fixture at `tests/fixtures/refractiveindex_sample.yaml` exercises the
parser on a synthetic 4-row YAML.

Curation deps: `pyyaml>=6.0` added to
`scripts/requirements-curation.txt` (NOT a runtime dep of mat).

Live verification (2026-05-07): all 9 ENTRY_MAP paths resolve on the
upstream `main` branch; comparison report shows `MISSING` (so no
collisions with curated values) for every target.

Closes #164.
@gerchowl gerchowl enabled auto-merge (squash) May 6, 2026 22:17
@gerchowl gerchowl merged commit ee97441 into main May 6, 2026
18 checks passed
@vig-os-release-app vig-os-release-app Bot mentioned this pull request May 6, 2026
6 tasks
gerchowl added a commit that referenced this pull request May 6, 2026
…) (#203)

Adds `scripts/enrich_from_nist_webbook.py`, a comparison + add-only
writeback enricher for the five fluids that overlap between
`src/pymat/data/{gases,liquids}.toml` and the NIST Chemistry WebBook:
water (liquid), nitrogen, argon, helium, co2 (gas).

NIST has no JSON API; the script hits the IsoBar TSV endpoint
(`fluid.cgi?Action=Data&Type=IsoBar`) at a fixed pressure of 1.01325
bar (1 atm) over a 5 K range straddling the target temperature, then
selects the row matching the target. Targets: T=293.15 K for water
(liquid phase), T=298.15 K for the gases (STP). The (T, P) point and
the `Lemmon REFPROP equation of state` provenance are pinned in every
`_sources.note`.

Properties enriched (only those already in the schema):
  * mechanical.density          (kg/m³ → g/cm³)
  * thermal.specific_heat       (J/(g·K) → J/(kg·K))
  * thermal.thermal_conductivity (W/(m·K), identity)

Viscosity is in the WebBook payload but the runtime schema has no
viscosity field today — skipped rather than invented. Temperature
curves are out of scope (separate Phase-4 follow-up).

Behaviour mirrors the other enrichers (#197/#198/#200/#201):
  * Default: comparison-only (DIFF threshold 2%).
  * --write: ADD-ONLY; existing values are NEVER overwritten.
  * --dry-run / --key / --report flags identical to the siblings.
  * Uses `cached_get_text` (30-day TTL) so NIST's polite-traffic
    expectations are respected; uses `build_source_row` and
    `writeback` from `scripts/_curation.py`.

`_sources` rows: license = "PD-USGov" (allow-listed since #184), kind
= "handbook", citation includes the species, ref pins
`webbook.nist.gov:fluid.cgi?ID=<CAS>`. The report footer carries the
NIST AS-IS notice. Attribution is not legally required for PD-USGov,
and `LICENSES-DATA.md` only enumerates CC-BY/CC-BY-SA sources, so no
edit there.

Live run against the 5 fluids: 0 MISSING (every field is already
curated), 1 DIFF (helium thermal_conductivity at 2.1% vs NIST
0.15531 W/(m·K)). The other 14 cells are within 2% of NIST. All
existing values are preserved.

Tests: `tests/test_nist_webbook_enrichment.py` (11 cases) covers TSV
parsing, fixture-based --dry-run (offline via monkeypatched
`cached_get_text`), --write add-only, no-overwrite semantics, comment
round-trip, --report file output, and the PD-USGov allow-list
canary. Fixture `tests/fixtures/nist_webbook_water.txt` holds a
trimmed real WebBook response (header + 2 rows) so the test suite
runs fully offline.

Closes #159
gerchowl added a commit that referenced this pull request May 7, 2026
…#134, #135, #136, #137) (#208)

* feat(data): add 4 technical ceramics (AlN, LTCC 951, sapphire, Si3N4)

Adds four primary-source-cited ceramic entries to
`src/pymat/data/ceramics.toml`, with `_sources` provenance per property:

- Aluminum nitride (AlN) — sintered AlN-170 grade. Vendor scalars from
  CoorsTek; Slack 1987 (DOI 10.1016/0022-3697(87)90153-3) cited as the
  intrinsic-thermal-conductivity review (~285 W/(m·K) single-crystal
  upper bound vs the 170 W/(m·K) commercial scalar). Anisotropic CTE
  [a,a,c] = [4.2e-6, 4.2e-6, 5.3e-6] / K. Closes #134.

- LTCC DuPont 951 — DuPont GreenTape 951 public TDS (MCM951, 7/2011).
  All scalars verbatim from the "Typical Tape Properties" table.
  Includes x/y shrinkage 12.7 %, z shrinkage 15 %, dielectric constant
  at 10 GHz, and surface roughness in a `[ltcc951.custom]` block since
  the schema has no shrinkage field. License = proprietary-reference-only.
  Closes #135.

- Sapphire (single-crystal Al2O3) — distinct entry from polycrystalline
  `alumina`. Refractive index n_o = 1.768 at sodium D-line from Malitson
  & Dodge 1972 (DOI 10.1364/JOSA.62.001405). Mechanical and thermal
  scalars from Crystran technical brief; hardness/toughness from
  Dobrovinskaya 2009 textbook (Springer ISBN 978-0-387-85695-7).
  Anisotropic CTE [a,a,c] = [5.0e-6, 5.0e-6, 5.6e-6] / K. Canonical
  scalars favor c-axis where the schema accepts a single value;
  perpendicular-c values documented in source notes. No
  refractive_index_dispersion populated — that's #201's enricher
  territory. Closes #136.

- Silicon nitride (Si3N4) — sintered SSN/DPSSN grade, the most common
  engineering Si3N4. All scalars verbatim from the Superior Technical
  Ceramics 2021 datasheet (ASTM-method-stamped). Riley 2000 (DOI
  10.1111/j.1151-2916.2000.tb01182.x) cited as the materials-class
  review. Closes #137.

Each `_sources` entry uses the standard kind (doi/handbook/vendor) and
license (proprietary-reference-only / CC-BY-SA-4.0 for Wikipedia).
`note` fields document conditions (room temperature, polycrystalline vs
single-crystal, grade designation) and any unit conversions.

Updates `_CATEGORY_BASES["ceramics"]` in `src/pymat/__init__.py` to
include the new top-level keys (mirrors PR #206 pattern).

Validation:
- `python -c "import tomllib; tomllib.loads(...)"` parses cleanly
- `python scripts/check_licenses.py` passes (7 TOMLs scanned)
- `pytest` — 627 passed, 25 skipped
- `ruff check . && ruff format --check .` — all clean
- Spot-check: each material's properties load and source_of() returns
  the cited primary source.

* chore(typos): allowlist DuPont brand name (#135)

`typos` was rewriting "DuPont" → "DuPoint" inside ceramics.toml
citation notes for LTCC 951. Adding the brand to the allowlist;
same pattern as the LSO and ANID entries above.

* chore(typos): allowlist Pont substring (#135)

`typos` 1.46 matches the substring "Pont" → "Point" inside any
word, so allowlisting just "DuPont" doesn't suppress the rewrite —
need to allowlist "Pont" itself. Verified locally that ceramics.toml
no longer trips the hook.
@vig-os-release-app vig-os-release-app Bot mentioned this pull request May 7, 2026
7 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Integrate refractiveindex.info (CC0) — n,k dispersion library

1 participant