Skip to content

feat(validate): ontology policy check (#69 Session 2)#97

Merged
stackbilt-admin merged 2 commits intomainfrom
feat/validate-ontology-policy-69
Apr 9, 2026
Merged

feat(validate): ontology policy check (#69 Session 2)#97
stackbilt-admin merged 2 commits intomainfrom
feat/validate-ontology-policy-69

Conversation

@stackbilt-admin
Copy link
Copy Markdown
Member

Summary

Session 2 of #69 ships the enforcement half of the typed data access policy:

```
charter validate --policy typed-data-access
```

A deterministic commit-time check that loads a data-registry YAML file, scans the current diff for business-concept references, and flags non-canonical alias usage in new code. Built on the Session 1 policy module (#96) and the canonical registry in stackbilt_llc.

What it does

Given this diff:
```ts
export async function handler(tenantId: string) {
const credits = await checkCredits(tenantId);
const quota = await getQuota(tenantId);
const tier = subscription.tier;
}
```

Charter validate reports:
```
[warn] Ontology policy: WARN
4 non-canonical aliases found in 1 changed file.
Registry: .charter/data-registry.yaml (default) | 21 concepts

Referenced concepts:

  • quota 4× | edge-auth | cross_service_rpc
  • subscription 3× | edge-auth | billing_critical
  • user 2× | edge-auth | pii_scoped

Violations (4):

  • [WARN] src.ts:2
    Uses alias 'credits' for concept 'quota' (edge-auth, cross_service_rpc).
    Prefer the canonical form in new code.
  • [WARN] src.ts:5
    Uses alias 'tier' for concept 'subscription' (edge-auth, billing_critical).

Suggestions:

  • Prefer canonical forms in new code: credits → quota, tier → subscription
    ```

Architecture

Pure-logic core (@stackbilt/validate)

  • parseOntologyRegistry — minimal YAML subset parser for the registry format. Zero external dependencies.
  • checkOntologyDiff — scans changed lines, returns informational references + alias-usage violations
  • extractIdentifiersFromLine + stripCommentsAndStrings — language-agnostic tokenization with comment/string stripping. Handles JS/C `//`, YAML/shell `#`, SQL `--`, inline `/* */`, JSDoc continuation lines, and all string literals. Guards against false-stripping URLs (`http://`) and C preprocessor directives (`#include`).
  • normalizeToken — unified identifier normalization (lowercase, strip separators) so `tenant_id`, `tenantId`, `TENANT_ID` all match.

CLI surface (@stackbilt/cli)

  • validate.ts — new `--policy` dispatcher. Routes to ontology check when policy matches; throws on unknown policies.
  • validate-ontology.ts — CLI wrapper:
    • Loads registry via --registry flag, config, or default path
    • Parses git diff --unified=0 to extract added lines only
    • Applies per-repo alias ignore list
    • Skips test files and fixture dirs by default (`--scan-tests` to opt back in)
    • Formats output as text or JSON
    • Exit codes: 0 on PASS, 0 on WARN (without --ci), 1 on WARN+--ci or FAIL

Config (`.charter/config.json`)

```jsonc
{
"ontology": {
"registry": ".charter/data-registry.yaml",
"ignoreAliases": ["token", "key", "usage", "audit"]
}
}
```

Two iterations after initial ship

Dogfooding the first iteration against charter's own source tree surfaced two classes of false positives:

1. JSDoc continuation lines

Multi-line block comments (`/** ... /`) weren't stripped because each line was processed independently. Lines like ` tiers and sensitivity levels` in JSDoc prose were tokenized as code. Fixed by treating lines starting with `*` or `/**` as full-line comments (with multiplication guard).

2. Generic alias collisions

Aliases like `token`, `key`, `usage`, `tier`, `audit` collide with common programming vocabulary. Charter's own codebase uses these words in their programming sense (lexer tokens, map keys, resource usage). Every file containing the word was getting flagged.

Fix: added `ontology.ignoreAliases` per-repo config field. Registry stays authoritative ecosystem-wide, but individual repos can opt out of specific noisy alias tokens without touching the shared source of truth.

Charter's own `.charter/config.json` ignores: `token, tokens, key, keys, usage, audit, tier, plan, limit, limits`.

Validation

Unit tests (45 new, 399 total)

  • Registry parsing (comments, flow sequences, null tables, malformed input)
  • Identifier extraction (// comments, # comments, SQL --, JSDoc, string literals)
  • URL/preprocessor guards
  • Alias detection, canonical vs alias counts
  • Dedup within a line
  • Per-repo ignore list
  • Multiplication vs JSDoc distinction

End-to-end dogfood

  • Fixture test: a 7-line handler with 4 alias usages + 3 canonical → reports exactly 4 violations, 0 false positives
  • Charter self-scan: 916 added lines across 5 files on this branch → PASS, 0 violations after tuning

Full suite

  • 399/399 charter tests passing (+45)
  • Build clean, typecheck clean
  • `pnpm run verify:adf` all PASS

What's NOT in this PR (deferred to Session 3+)

  • FAIL-severity rules — detecting raw D1 access to other services' tables (needs AST-ish parsing)
  • Unregistered-concept heuristic — flag identifiers that LOOK like business terms but aren't in the registry at all
  • charter doctor integration — standing health checks for registry configuration
  • Registry cleanup PR in stackbilt_llc — narrow the noisy aliases upstream so downstream repos don't need ignore lists (filing as follow-up)

References

Governed-By: #69

🤖 Generated with Claude Code

Kurt Overmier added 2 commits April 9, 2026 15:38
…ed-data-access (#69)

Session 2 of charter#69. Ships the enforcement half of the typed data
access policy: a deterministic commit-time check that loads a data-
registry YAML file and flags non-canonical alias usage in the current
diff.

## What this lands

### Pure-logic core (@stackbilt/validate)

- **`parseOntologyRegistry(yamlText)`** — minimal YAML subset parser
  tailored to the stackbilt_llc/policies/data-registry.yaml shape.
  Handles: 2-level nested maps, inline flow-sequences, # comments,
  blank lines, bare string values, table: null for derived concepts.
  No external dependencies (keeps validate zero-dep beyond @stackbilt/types).
- **`checkOntologyDiff(changedLines, registry)`** — scans each line's
  identifiers against the registry's canonical + alias indexes. Returns
  two outputs: informational references (what concepts were touched)
  and violations (WARN on non-canonical alias usage in new code).
- **`extractIdentifiersFromLine`** + **`stripCommentsAndStrings`** —
  language-agnostic token extraction that strips JS/C `//`, YAML/shell
  `#`, SQL `--`, and all string literals before tokenizing. Prevents
  false positives on alias words appearing in comments or user-facing
  copy (e.g., "usage" in a TODO comment). Guards against stripping URLs
  (`http://`) and C-style preprocessor directives (`#include`).
- **`normalizeToken`** — lowercases, strips underscores/hyphens/spaces
  so `tenant_id`, `tenantId`, `TENANT-ID`, and `tenant id` all normalize
  to the same token.
- Six sensitivity tiers typed as `OntologySensitivityTier`.

### CLI surface (@stackbilt/cli)

- **`charter validate --policy typed-data-access`** — new policy dispatch
  in validate.ts. When `--policy typed-data-access` is provided, routes
  to runOntologyPolicyCheck in validate-ontology.ts instead of the
  default trailer validation. Unknown policies throw CLIError.
- **`validate-ontology.ts`** — CLI wrapper:
  - Loads registry via --registry flag, .charter/config.json ontology.registry,
    or default path (.charter/data-registry.yaml)
  - Extracts added lines from `git diff --unified=0 <range>`
  - Calls the validate package's checker
  - Formats output in text or JSON (per --format)
  - Exit codes: 0 on PASS, 0 on WARN without --ci, 1 on WARN+--ci or FAIL
- **`.charter/config.json` ontology section** — new optional config field:
  `{ "ontology": { "registry": "path/to/registry.yaml" } }`.
  Path is resolved relative to .charter/ directory if not absolute.

### Tests (41 new, 395 total)

Covers registry parsing (scalar fields, flow sequences, comments,
null tables, malformed input), identifier extraction (comment stripping
for JS/#/SQL, string literal stripping for ', ", \`), URL guard, alias
detection, canonical vs alias reference counts, dedup within a line,
ignoreAliasViolations suppression, and violation metadata propagation.

## Validation

### Local e2e against the real stackbilt_llc registry (21 concepts)

Added a handler.ts file with 4 alias usages (credits, tier), 3 canonical
usages (quota, user, subscription). The check correctly reported:
- 4 non-canonical aliases (credits × 2, tier × 2)
- Correctly flagged each with file:line, canonical form, owner service,
  sensitivity tier
- 3 clean canonical references surfaced informationally
- Actionable suggestions: "credits → quota, tier → subscription"
- False-positive check: the first iteration extracted "usage" from //
  comments, the stripCommentsAndStrings pass eliminated that noise

### Test suite

- 41 new ontology.test.ts unit tests, all passing
- 395/395 total charter tests passing (no regressions, was 354)
- pnpm run build: clean
- pnpm run verify:adf: PASS on all metrics

## Not yet

Session 3+ will add:
- FAIL-severity detection for direct D1 access to other services' tables
- Unregistered-concept heuristic (flag identifiers that look like business
  terms but aren't in the registry)
- Charter doctor integration for standing registry health checks
- Auto-sync of the registry across consumer repos

## References

- Closes part of #69 (Session 2 of 4)
- Session 1 (policy module): charter#96
- Source registry: Stackbilt-dev/stackbilt_llc/policies/data-registry.yaml
- Related: codebeast#9 (DATA_AUTHORITY), aegis#344 (disambiguation firewall)
…check

Dogfooding Session 2 of charter#69 against charter's own source tree
surfaced two classes of false positives:

## JSDoc interior lines

Multi-line block comments (/** ... */) weren't stripped because the
per-line comment logic only handled //, #, --, and inline /* */. JSDoc
continuation lines look like ` * some prose here` and were tokenized
as if they were code.

Fix: stripCommentsAndStrings now treats lines whose first non-whitespace
token is `*` or `/**` as full-line comments. If `*/` appears later on
the line, content after it is preserved. Distinct from multiplication
because the `*` must be the first non-whitespace character, not an
operator between operands.

4 new test cases cover JSDoc interior, JSDoc opener, multiplication
preservation (`2 * tier`), and closer-on-same-line edge case.

## Generic alias collisions in programming vocabulary

Several registry aliases (token, key, usage, tier, audit) collide with
common programming vocabulary. Every charter file that mentions `token`
(lexer token, API token) or `key` (map key, lookup key) was getting
flagged against the api_key/quota concepts, producing dozens of
noise-warnings per file.

Fix: new `ontology.ignoreAliases` config field in .charter/config.json.
Accepts an array of normalized alias tokens that the check should
silently skip (still reports canonical references). Implemented as
a per-repo override — the registry stays authoritative ecosystem-wide,
but individual repos can opt out of noisy terms for their codebase.

Wired into checkOntologyDiff via `options.ignoredAliasTokens: Set<string>`.
Refactored runOntologyPolicyCheck to load config once and pass it to
resolveRegistryPath (was loading twice).

Charter's own .charter/config.json now ignores: token, tokens, key, keys,
usage, audit, tier, plan, limit, limits — all programming-vocabulary
collisions with the current registry. The ignore list should stay small
and case-specific; prefer narrowing the registry's own aliases upstream
when a term is globally noisy.

## Validation

- Charter main..HEAD dogfood: PASS — 916 added lines across 5 files,
  0 violations, 0 false positives
- 45/45 ontology tests passing (+4 from JSDoc cases)
- 399/399 total charter tests passing (+4 from 395)
- Build clean, typecheck clean

## Follow-up signals

The dogfood surfaced a real finding worth reporting upstream: the
stackbilt_llc data-registry.yaml aliases list is overclaimed. Words
like `token`, `key`, `usage`, `tier`, `audit`, `plan`, `limit` are too
generic to be reliable aliases — they match ordinary programming
vocabulary. A registry cleanup PR in stackbilt_llc would remove these
from the aliases lists, making the per-repo ignore list unnecessary
for most downstream consumers. Filing as follow-up.

Part of #69 (Session 2 tuning).
@stackbilt-admin stackbilt-admin merged commit d662d98 into main Apr 9, 2026
3 checks passed
@stackbilt-admin stackbilt-admin deleted the feat/validate-ontology-policy-69 branch April 9, 2026 20:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant