Catalog Engine

Catalog engine

A catalog is the machine-readable representation of a compliance framework's controls — NIST SP 800-53 Rev 5, SOC 2 TSC, ISO 27001, the OpenSSF OSPS Baseline, and ~88 others. Catalogs are the "target state" in gap analysis: the gap analyzer compares your evidence against a catalog's controls and reports what's missing. This page explains how catalogs are stored, discovered, loaded, validated, and indexed, and how to regenerate the manifest after adding one.

The code lives in packages/evidentia-core/src/evidentia_core/catalogs/.

The manifest is the single source of truth

Bundled catalogs are enumerated in one file: catalogs/data/frameworks.yaml. As of v0.10.6 it lists 92 frameworks. The manifest's header makes the contract explicit:

# GENERATED BY scripts/catalogs/regenerate_manifest.py — do not hand-edit.
# To add/remove a framework: change the JSON in data/<tier-dir>/ then re-run the
# regeneration script. The manifest reflects what is on disk.

The manifest is loaded and validated by manifest.py into typed Pydantic records:

FrameworkManifestEntry — one framework. Fields: id (canonical kebab-case ID, stable across versions), name, version, tier, category, path (catalog file path relative to data/), source_url, license, license_required, license_url, placeholder, extras (install-extra gating, e.g. "stigs"), and refresh (the CI refresh cadence). The record uses extra="forbid", so a malformed manifest entry fails loudly at load.
FrameworkManifest — the root document: version: int + frameworks: list[FrameworkManifestEntry], plus .get(id), .by_tier(tier), and .by_category(category) helpers.

load_manifest(path=None) is @cache-decorated (the manifest is immutable at runtime once loaded; tests that need a fixture pass an explicit path, which keys the cache). After parsing it runs an integrity check that fails loud on duplicate IDs — two entries with the same id would silently shadow each other through .get(), so the loader raises ValueError instead.

Redistribution tiers and categories

Two Literal types in manifest.py carry legal + structural metadata that the rest of the engine reads:

Tier = "A" | "B" | "C" | "D". A = verbatim redistribution OK (US federal works, CC-BY, public domain); B = free with conditions (MITRE ATT&CK, CISA KEV); C = copyrighted, stub-only (ISO, SOC 2 TSC, PCI DSS, HITRUST, CIS — Evidentia ships a placeholder and the operator supplies their licensed copy); D = government regulation text, bundlable with attribution (GDPR, EU AI Act, state privacy laws). The full per-framework analysis is in ATTRIBUTION.md.
Category = "control" | "technique" | "vulnerability" | "obligation". Most catalogs are control; the others cover ATT&CK/CWE techniques, KEV-style vulnerability lists, and privacy-law obligations, each of which loads into a different Pydantic model.

The `_load_catalog_data` choke point

Catalogs ship as either JSON (NIST OSCAL exports, and the original Evidentia format) or, since v0.10.3, YAML (friendlier for hand-authoring — comments, multi-line strings, no comma/escape headaches). Both formats produce the same dict shape, so a single helper in loader.py dispatches on file extension:

def _load_catalog_data(catalog_path: Path) -> dict[str, Any]:
    suffix = catalog_path.suffix.lower()
    text = catalog_path.read_text(encoding="utf-8")
    if suffix in (".yaml", ".yml"):
        data = yaml.safe_load(text)
    elif suffix == ".json":
        data = json.loads(text)
    else:
        # ... explicit, self-resolving error for missing/unknown extension ...
    if not isinstance(data, dict):
        raise ValueError(f"{suffix} catalog {catalog_path.name} top-level must be a mapping ...")
    return data

Two design rules make this a choke point rather than just a convenience:

Centralized extension dispatch. Every catalog file read in the loader module goes through _load_catalog_data. The module docstring states the invariant directly: never add a sibling json.loads / yaml.safe_load call elsewhere, because this helper is where extension dispatch and non-mapping-root rejection live. New loaders accept a Path and call this first.
Non-mapping-root rejection. A YAML or JSON file whose top level is a list or scalar is rejected at the choke point with a clear message, before any downstream code assumes a dict.

The error path for a missing extension is deliberately self-resolving (v0.10.4 polish): a file with no suffix gets told exactly how to fix it (mv catalog catalog.yaml), because the bare-suffix default ('') is otherwise opaque to an operator who drag-and-dropped a file.

Loading into typed models

_load_catalog_data returns a plain dict; the typed loaders turn it into a model. The dispatch in load_catalog / load_any_catalog:

A dict with a top-level "catalog" key → load_oscal_catalog (an OSCAL Catalog JSON: groups → controls → enhancements). This walks the OSCAL structure, extracting the statement prose, assessment objectives, priority, baseline impact, related controls, and parameters from each control's parts/props/links/params.
Otherwise the dict's category field decides: "control" → load_evidentia_catalog (the simplified Evidentia format used for frameworks without published OSCAL, like SOC 2 / ISO / CIS / CMMC / PCI DSS); "technique" / "vulnerability" / "obligation" → load_non_control_catalog, which returns TechniqueCatalog, VulnerabilityCatalog, or ObligationCatalog respectively.

load_catalog raises if asked for a control catalog when the on-disk data is a non-control category (use load_any_catalog when you don't know the shape). When no explicit path is given, the path is resolved through resolve_catalog_path, which lets a user-imported catalog shadow a bundled stub of the same framework_id — an organization's licensed ISO 27001 copy wins over the Tier-C placeholder, and the shadow is logged.

The control model and its index

ControlCatalog (models/catalog.py) holds the parsed framework: framework_id, framework_name, version, source, controls: list[CatalogControl], families, optional family_hierarchy, the category literal, and tier/license/placeholder metadata.

Each CatalogControl carries id, title, description, family, OSCAL-derived fields (objective, guidance, priority, baseline_impact, related_controls, assessment_objectives, parameters), nested enhancements: list[CatalogControl], and per-control tier/license metadata. AI-governance catalogs additionally use risk_tier and applies_to_annex_iii (v0.9.3).

The clever part is the lookup index, built in model_post_init. NIST renders enhancement IDs two ways — AC-2(1)(a) in publications, ac-2.1.a in OSCAL content — and both are valid input. The catalog normalizes every ID to a dotted, uppercase canonical form (_normalize_control_id, backed by the _PAREN_TO_DOT regex), then walks the full enhancement tree to build a flat {normalized_id: CatalogControl} index. So get_control("AC-2(1)") and get_control("ac-2.1") resolve to the same control, case-insensitively and whitespace-tolerantly. control_count is the size of that index (counting enhancements).

The normalizing regex is bounded ([^()]{1,16}) specifically so it cannot backtrack quadratically on pathological input like a long run of open-parens — the bound comfortably exceeds any real control ID.

The registry

FrameworkRegistry (registry.py) is the process-wide singleton that ties it together. It loads the manifest at construction, lazily loads + caches catalogs on first access (get_catalog, get_control), and exposes the crosswalk engine (also lazy) via its crosswalk property. list_frameworks(tier=None, category=None) returns manifest entries in declaration order, optionally filtered. For backward compatibility with v0.1.x callers, a module-level FRAMEWORK_METADATA dict is computed once from the manifest at import time, but new code should prefer load_manifest() or the registry's typed manifest property.

Adding a framework + regenerating the manifest

The workflow is "drop a file, run the script":

Drop a catalog JSON (or YAML) into the appropriate tier directory under catalogs/data/ (us-federal/, international/, state-privacy/, threats/, stubs/).
Run uv run python scripts/catalogs/regenerate_manifest.py. The script scans each tier directory, accepts .json / .yaml / .yml catalog files, infers a best-guess refresh cadence from tier + category, and rewrites frameworks.yaml so the manifest is truthful by construction — the contents of data/<tier>/ are the manifest, no hand-maintained sync.

One important filter (added v0.10.6 P1): the regenerator skips OSCAL-Catalog sidecar artifacts matching *.oscal.json / *.oscal.yaml. These are downstream-consumption serializations (for example osps-baseline.oscal.json, the OSCAL Catalog 1.2.1 form that ships alongside the OSPS Baseline YAML catalogs) and must not register as their own manifest entries — otherwise the same framework would appear twice. This is why three OSPS YAML catalogs raised the count to 92 while their companion .oscal.json did not.

Catalog Engine

Catalog engine

The manifest is the single source of truth

Redistribution tiers and categories

The _load_catalog_data choke point

Loading into typed models

The control model and its index

The registry

Adding a framework + regenerating the manifest

Related reading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Evidentia wiki

Clone this wiki locally

The `_load_catalog_data` choke point