-
Notifications
You must be signed in to change notification settings - Fork 1
Catalog Engine
A catalog is the machine-readable representation of a compliance framework's controls — NIST SP 800-53 Rev 5, SOC 2 TSC, ISO 27001, the OpenSSF OSPS Baseline, and ~88 others. Catalogs are the "target state" in gap analysis: the gap analyzer compares your evidence against a catalog's controls and reports what's missing. This page explains how catalogs are stored, discovered, loaded, validated, and indexed, and how to regenerate the manifest after adding one.
The code lives in packages/evidentia-core/src/evidentia_core/catalogs/.
Bundled catalogs are enumerated in one file: catalogs/data/frameworks.yaml. As of v0.10.6 it lists 92 frameworks. The manifest's header makes the contract explicit:
# GENERATED BY scripts/catalogs/regenerate_manifest.py — do not hand-edit.
# To add/remove a framework: change the JSON in data/<tier-dir>/ then re-run the
# regeneration script. The manifest reflects what is on disk.
The manifest is loaded and validated by manifest.py into typed Pydantic records:
-
FrameworkManifestEntry— one framework. Fields:id(canonical kebab-case ID, stable across versions),name,version,tier,category,path(catalog file path relative todata/),source_url,license,license_required,license_url,placeholder,extras(install-extra gating, e.g."stigs"), andrefresh(the CI refresh cadence). The record usesextra="forbid", so a malformed manifest entry fails loudly at load. -
FrameworkManifest— the root document:version: int+frameworks: list[FrameworkManifestEntry], plus.get(id),.by_tier(tier), and.by_category(category)helpers.
load_manifest(path=None) is @cache-decorated (the manifest is immutable at runtime once loaded; tests that need a fixture pass an explicit path, which keys the cache). After parsing it runs an integrity check that fails loud on duplicate IDs — two entries with the same id would silently shadow each other through .get(), so the loader raises ValueError instead.
Two Literal types in manifest.py carry legal + structural metadata that the rest of the engine reads:
-
Tier="A" | "B" | "C" | "D". A = verbatim redistribution OK (US federal works, CC-BY, public domain); B = free with conditions (MITRE ATT&CK, CISA KEV); C = copyrighted, stub-only (ISO, SOC 2 TSC, PCI DSS, HITRUST, CIS — Evidentia ships a placeholder and the operator supplies their licensed copy); D = government regulation text, bundlable with attribution (GDPR, EU AI Act, state privacy laws). The full per-framework analysis is inATTRIBUTION.md. -
Category="control" | "technique" | "vulnerability" | "obligation". Most catalogs arecontrol; the others cover ATT&CK/CWE techniques, KEV-style vulnerability lists, and privacy-law obligations, each of which loads into a different Pydantic model.
Catalogs ship as either JSON (NIST OSCAL exports, and the original Evidentia format) or, since v0.10.3, YAML (friendlier for hand-authoring — comments, multi-line strings, no comma/escape headaches). Both formats produce the same dict shape, so a single helper in loader.py dispatches on file extension:
def _load_catalog_data(catalog_path: Path) -> dict[str, Any]:
suffix = catalog_path.suffix.lower()
text = catalog_path.read_text(encoding="utf-8")
if suffix in (".yaml", ".yml"):
data = yaml.safe_load(text)
elif suffix == ".json":
data = json.loads(text)
else:
# ... explicit, self-resolving error for missing/unknown extension ...
if not isinstance(data, dict):
raise ValueError(f"{suffix} catalog {catalog_path.name} top-level must be a mapping ...")
return dataTwo design rules make this a choke point rather than just a convenience:
-
Centralized extension dispatch. Every catalog file read in the loader module goes through
_load_catalog_data. The module docstring states the invariant directly: never add a siblingjson.loads/yaml.safe_loadcall elsewhere, because this helper is where extension dispatch and non-mapping-root rejection live. New loaders accept aPathand call this first. - Non-mapping-root rejection. A YAML or JSON file whose top level is a list or scalar is rejected at the choke point with a clear message, before any downstream code assumes a dict.
The error path for a missing extension is deliberately self-resolving (v0.10.4 polish): a file with no suffix gets told exactly how to fix it (mv catalog catalog.yaml), because the bare-suffix default ('') is otherwise opaque to an operator who drag-and-dropped a file.
_load_catalog_data returns a plain dict; the typed loaders turn it into a model. The dispatch in load_catalog / load_any_catalog:
- A dict with a top-level
"catalog"key →load_oscal_catalog(an OSCAL Catalog JSON:groups → controls → enhancements). This walks the OSCAL structure, extracting the statement prose, assessment objectives, priority, baseline impact, related controls, and parameters from each control'sparts/props/links/params. - Otherwise the dict's
categoryfield decides:"control"→load_evidentia_catalog(the simplified Evidentia format used for frameworks without published OSCAL, like SOC 2 / ISO / CIS / CMMC / PCI DSS);"technique"/"vulnerability"/"obligation"→load_non_control_catalog, which returnsTechniqueCatalog,VulnerabilityCatalog, orObligationCatalogrespectively.
load_catalog raises if asked for a control catalog when the on-disk data is a non-control category (use load_any_catalog when you don't know the shape). When no explicit path is given, the path is resolved through resolve_catalog_path, which lets a user-imported catalog shadow a bundled stub of the same framework_id — an organization's licensed ISO 27001 copy wins over the Tier-C placeholder, and the shadow is logged.
ControlCatalog (models/catalog.py) holds the parsed framework: framework_id, framework_name, version, source, controls: list[CatalogControl], families, optional family_hierarchy, the category literal, and tier/license/placeholder metadata.
Each CatalogControl carries id, title, description, family, OSCAL-derived fields (objective, guidance, priority, baseline_impact, related_controls, assessment_objectives, parameters), nested enhancements: list[CatalogControl], and per-control tier/license metadata. AI-governance catalogs additionally use risk_tier and applies_to_annex_iii (v0.9.3).
The clever part is the lookup index, built in model_post_init. NIST renders enhancement IDs two ways — AC-2(1)(a) in publications, ac-2.1.a in OSCAL content — and both are valid input. The catalog normalizes every ID to a dotted, uppercase canonical form (_normalize_control_id, backed by the _PAREN_TO_DOT regex), then walks the full enhancement tree to build a flat {normalized_id: CatalogControl} index. So get_control("AC-2(1)") and get_control("ac-2.1") resolve to the same control, case-insensitively and whitespace-tolerantly. control_count is the size of that index (counting enhancements).
The normalizing regex is bounded ([^()]{1,16}) specifically so it cannot backtrack quadratically on pathological input like a long run of open-parens — the bound comfortably exceeds any real control ID.
FrameworkRegistry (registry.py) is the process-wide singleton that ties it together. It loads the manifest at construction, lazily loads + caches catalogs on first access (get_catalog, get_control), and exposes the crosswalk engine (also lazy) via its crosswalk property. list_frameworks(tier=None, category=None) returns manifest entries in declaration order, optionally filtered. For backward compatibility with v0.1.x callers, a module-level FRAMEWORK_METADATA dict is computed once from the manifest at import time, but new code should prefer load_manifest() or the registry's typed manifest property.
The workflow is "drop a file, run the script":
- Drop a catalog JSON (or YAML) into the appropriate tier directory under
catalogs/data/(us-federal/,international/,state-privacy/,threats/,stubs/). - Run
uv run python scripts/catalogs/regenerate_manifest.py. The script scans each tier directory, accepts.json/.yaml/.ymlcatalog files, infers a best-guess refresh cadence from tier + category, and rewritesframeworks.yamlso the manifest is truthful by construction — the contents ofdata/<tier>/are the manifest, no hand-maintained sync.
One important filter (added v0.10.6 P1): the regenerator skips OSCAL-Catalog sidecar artifacts matching *.oscal.json / *.oscal.yaml. These are downstream-consumption serializations (for example osps-baseline.oscal.json, the OSCAL Catalog 1.2.1 form that ships alongside the OSPS Baseline YAML catalogs) and must not register as their own manifest entries — otherwise the same framework would appear twice. This is why three OSPS YAML catalogs raised the count to 92 while their companion .oscal.json did not.
- Architecture — where the catalog engine sits in the pipeline.
- Crosswalk engine — the companion engine that maps controls between frameworks.
-
Data model — the
CatalogControl/ControlCatalogshapes in the model context. -
4-reference/catalogs.md— the auto-generated inventory of all bundled catalogs. -
5-compliance/contributing-a-catalog.md— the step-by-step contributor guide.
-
- AI Governance
- Air Gapped Install
- Ci Integration
- CONMON Deployment
- Emit Cyclonedx VEX
- Emit OCSF Detection
- Emit SARIF
- Explain Controls
- Generate And Quantify Risk
- Governance Metrics And Workflows
- Ingest OCSF
- Manage Model Risk
- Manage POAM
- Manage Third Party Risk
- MCP Client Setup
- OSPS Self Assessment
- Run Gap Analysis
- Serve The Web Ui
- Sign And Verify Evidence