feat: add lexical defense dictionary exports by EtanHey · Pull Request #262 · EtanHey/brainlayer

EtanHey · 2026-04-29T22:57:41Z

Summary

add a checked-in lexical-defense dictionary for Hebrew names, proper nouns, and recurring ecosystem terms
add BrainLayer exports for VoiceLayer snapshot JSON and entity-only whisper GBNF
add contract tests for lookup, Hebrew entries, Swift override ordering, and downstream export shapes

Verification

pytest -q tests/test_lexical_defense.py
ruff check src/brainlayer/lexical_defense.py tests/test_lexical_defense.py scripts/export_lexical_defense_snapshot.py
full pre-push gate passed:
- pytest unit suite => 1787 passed, 2 skipped, 75 deselected, 1 xfailed
- pytest MCP tool registration => 3 passed
- pytest isolated eval and hook routing => 32 passed
- bun test suite => 1 pass
- test_fts5_determinism.sh => passed

Note

Low Risk
Purely additive (new module, data file, CLI, and tests) with no changes to existing runtime paths unless adopted by callers; main risk is downstream consumers relying on export/lookup semantics.

Overview
Adds a checked-in lexical defense dictionary (lexical_defense_dictionary.json) plus a new brainlayer.lexical_defense module that loads it (LRU-cached), normalizes/looks up surfaces, and generates downstream artifacts (Swift override patterns, VoiceLayer snapshot JSON, and a Whisper protected-entity GBNF).

Introduces a CLI script export_lexical_defense_snapshot.py to export those artifacts in json, voicelayer, or gbnf formats, and adds unit tests covering lookup normalization (incl. Hebrew), export shapes, and ordering guarantees.

^{Reviewed by Cursor Bugbot for commit 61c2d43. Bugbot is set up for automated code reviews on this repo. Configure here.}

Summary by CodeRabbit

New Features
- Introduced a lexical defense system to protect critical entities and terms from splitting or modification during processing.
- Added export functionality for the protected entities dictionary, supporting JSON, VoiceLayer, and GBNF grammar formats.
- Deployed a comprehensive dictionary of protected terms including domain entities, developer terminology, and person/entity names.

Note

Add lexical defense dictionary with multi-format export script

Introduces LexicalDefenseDictionary wrapping a JSON data file with O(1) surface lookups, a pre-sorted replacement map, and methods for generating VoiceLayer snapshots, GBNF grammars, and Swift override patterns.
Adds export_lexical_defense_snapshot.py, a CLI script that loads the dictionary and writes output in json, voicelayer, or gbnf format to a specified path.
The dictionary loader is lru_cache-backed so repeated calls reuse the in-memory instance.
Surfaces are normalized via NFKC + casefolding before indexing to ensure consistent lookup behavior.

^{Macroscope summarized 61c2d43.}

EtanHey · 2026-04-29T22:57:48Z

@codex review
@cursor review
@BugBot review

coderabbitai · 2026-04-29T22:57:54Z

📝 Walkthrough

Walkthrough

A new lexical defense system is introduced, comprising a JSON dictionary of protected entities, a Python module providing lookup and export functionality across multiple formats (JSON, voicelayer, GBNF), a CLI script for exporting snapshots, and comprehensive tests validating the system's behavior.

Changes

Cohort / File(s)	Summary
Lexical Defense Core Module `src/brainlayer/lexical_defense.py`	New module introducing `LexicalDefenseEntry` dataclass and `LexicalDefenseDictionary` class with surface normalization (NFKC + casefold + alphanumeric filtering), entity lookup via canonical/alias/split-form indexing, and six public methods for generating grammar literals, SLM entity lines, Swift override patterns (sorted by descending priority), voicelayer snapshots, and GBNF grammar definitions. Includes cached dictionary loader.
Lexical Defense Data `src/brainlayer/lexical_defense_dictionary.json`	New JSON data file containing protected entities with version, generation timestamp, and 246 lines of entries covering domain terms (BrainLayer, VoiceLayer, etc.), developer-specific terms (repoGolem, orcClaude, cmux), and person/entity names in multiple scripts including Hebrew, along with metadata for each entry (category, script, split-protection flags, priorities, aliases, split forms, sources).
Export CLI Script `scripts/export_lexical_defense_snapshot.py`	New Python CLI entrypoint supporting three output formats via `--format` flag: `json` (structured payload with metadata and per-entry fields), `voicelayer` (voicelayer\_snapshot() JSON serialization), and `gbnf` (whisper entity grammar text). Includes argument parsing via `argparse`, directory creation, and success exit code `0`.
Lexical Defense Tests `tests/test_lexical_defense.py`	New test module validating dictionary loading, lookup functionality for canonicals/aliases/split-forms, Hebrew entry presence and behavior, Swift override pattern sorting by priority, voicelayer snapshot structure and content, and GBNF grammar rule presence.

Sequence Diagram

sequenceDiagram
    participant CLI as Export CLI Script
    participant Loader as Dictionary Loader
    participant Dict as LexicalDefenseDictionary
    participant JSON as JSON File
    participant Out as Output File

    CLI->>Loader: load_lexical_defense_dictionary()
    Loader->>JSON: read JSON (cached)
    JSON-->>Loader: entries + metadata
    Loader->>Dict: instantiate with parsed entries
    Dict->>Dict: normalize surfaces<br/>build indices
    Loader-->>CLI: LexicalDefenseDictionary instance
    
    CLI->>Dict: select format (json/voicelayer/gbnf)
    alt format == json
        Dict-->>CLI: structured dict with metadata
    else format == voicelayer
        Dict-->>CLI: voicelayer_snapshot() dict
    else format == gbnf
        Dict-->>CLI: whisper_entity_gbnf() text
    end
    
    CLI->>Out: write formatted output
    Out-->>CLI: success

Estimated Code Review Effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

Poem

🐰 A defense of lexicons we sow,
Where surfaces and aliases flow,
In JSON beds and Python roots,
We guard each entity's sacred shoots! 🌿
From NFKC to GBNF so true,
Protected terms in every view! ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'feat: add lexical defense dictionary exports' accurately and concisely summarizes the main change: introducing a lexical defense dictionary and its associated export functionality through a new CLI script.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feat/lexical-defense-dictionary

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

macroscopeapp · 2026-04-29T23:00:41Z

+    def whisper_entity_gbnf(self) -> str:
+        protected = [entry for entry in self.entries if entry.protect_from_split]
+        lines = ["root ::= protected_entity", ""]
+        lines.append(
+            "protected_entity ::= "
+            + " | ".join(f"entity_{index}" for index, _entry in enumerate(protected))
+        )
+        lines.append("")
+        for index, entry in enumerate(protected):
+            literal = entry.canonical.replace("\\", "\\\\").replace('"', '\\"')
+            lines.append(f'entity_{index} ::= "{literal}"')
+        return "\n".join(lines)


🟢 Low brainlayer/lexical_defense.py:102

whisper_entity_gbnf returns invalid GBNF when no entries have protect_from_split=True. The protected list is empty, so " | ".join(...) produces an empty string and the grammar contains protected_entity ::= with no right-hand side, which is syntactically invalid. Consider handling the empty case by returning an empty string, raising an error, or generating a valid fallback rule.

def whisper_entity_gbnf(self) -> str: protected = [entry for entry in self.entries if entry.protect_from_split] + if not protected: + return "" lines = ["root ::= protected_entity", ""]

🚀 Reply "fix it for me" or copy this AI Prompt for your agent:

In file src/brainlayer/lexical_defense.py around lines 102-113: `whisper_entity_gbnf` returns invalid GBNF when no entries have `protect_from_split=True`. The `protected` list is empty, so `" | ".join(...)` produces an empty string and the grammar contains `protected_entity ::= ` with no right-hand side, which is syntactically invalid. Consider handling the empty case by returning an empty string, raising an error, or generating a valid fallback rule. Evidence trail: src/brainlayer/lexical_defense.py lines 102-113 (REVIEWED_COMMIT): whisper_entity_gbnf filters entries by protect_from_split, joins with " | " which produces empty string when list is empty. src/brainlayer/lexical_defense.py lines 19-27: LexicalDefenseEntry dataclass with protect_from_split as a plain bool field. scripts/export_lexical_defense_snapshot.py line 53: caller writes whisper_entity_gbnf() output without checking for empty protected entries.

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/brainlayer/lexical_defense.py`:
- Around line 49-61: In _build_surface_index and _build_replacement_map, detect
duplicate keys before inserting: when iterating self.entries, compute the key
(use _normalize_surface(surface) in _build_surface_index and split_form in
_build_replacement_map) and if the key already exists in the local dict raise a
ValueError (or custom exception) that includes the conflicting key and
references to the existing and new LexicalDefenseEntry.canonical (or entry) to
fail fast instead of silently overwriting; keep the final sorting behavior in
_build_replacement_map after validation.

In `@tests/test_lexical_defense.py`:
- Around line 31-33: The test currently only checks the first and last pattern
priorities; tighten it by asserting the entire patterns list is non-increasing:
iterate the patterns list and for each adjacent pair ensure
patterns[i]["priority"] >= patterns[i+1]["priority"] (use the existing patterns
variable), so any mid-list regression fails; keep the other membership
assertions for "brain layer"/"voice layer" as-is.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: ef6bd45b-d6a7-43f0-a104-e8b1fa06cb63

📥 Commits

Reviewing files that changed from the base of the PR and between 4bebb81 and f08b22d.

📒 Files selected for processing (4)

scripts/export_lexical_defense_snapshot.py
src/brainlayer/lexical_defense.py
src/brainlayer/lexical_defense_dictionary.json
tests/test_lexical_defense.py

📜 Review details

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)

GitHub Check: Cursor Bugbot
GitHub Check: Macroscope - Correctness Check
GitHub Check: test (3.13)
GitHub Check: test (3.11)
GitHub Check: test (3.12)

🧰 Additional context used

📓 Path-based instructions (2)

**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

**/*.py: Flag risky DB or concurrency changes explicitly and do not hand-wave lock behavior
Enforce one-write-at-a-time concurrency constraint; reads are safe but brain_digest is write-heavy and must not run in parallel with other MCP work
Run pytest before claiming behavior changed safely; current test suite has 929 tests

**/*.py: Use paths.py:get_db_path() for all database path resolution; all scripts and CLI must use this function rather than hardcoding paths
When performing bulk database operations: stop enrichment workers first, checkpoint WAL before and after, drop FTS triggers before bulk deletes, batch deletes in 5-10K chunks, and checkpoint every 3 batches

Files:

scripts/export_lexical_defense_snapshot.py
tests/test_lexical_defense.py
src/brainlayer/lexical_defense.py

src/brainlayer/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

src/brainlayer/**/*.py: Use retry logic on SQLITE_BUSY errors; each worker must use its own database connection to handle concurrency safely
Classification must preserve ai_code, stack_trace, and user_message verbatim; skip noise entries entirely and summarize build_log and dir_listing entries (structure only)
Use AST-aware chunking via tree-sitter; never split stack traces; mask large tool output
For enrichment backend selection: use Groq as primary backend (cloud, configured in launchd plist), Gemini as fallback via enrichment_controller.py, and Ollama as offline last-resort; allow override via BRAINLAYER_ENRICH_BACKEND env var
Configure enrichment rate via BRAINLAYER_ENRICH_RATE environment variable (default 0.2 = 12 RPM)
Implement chunk lifecycle columns: superseded_by, aggregated_into, archived_at on chunks table; exclude lifecycle-managed chunks from default search; allow include_archived=True to show history
Implement brain_supersede with safety gate for personal data (journals, notes, health/finance); use soft-delete for brain_archive with timestamp
Add supersedes parameter to brain_store for atomic store-and-replace operations
Run linting and formatting with: ruff check src/ && ruff format src/
Run tests with pytest
Use PRAGMA wal_checkpoint(FULL) before and after bulk database operations to prevent WAL bloat

Files:

src/brainlayer/lexical_defense.py

🔇 Additional comments (4)

src/brainlayer/lexical_defense_dictionary.json (1)

1-246: Dictionary schema and term coverage look consistent.

The dataset shape is uniform and aligns with the new lexical-defense use cases (aliases/split forms/priority/source provenance).

tests/test_lexical_defense.py (1)

4-24: Coverage for lookup, Hebrew entries, snapshot, and GBNF is strong.

Also applies to: 36-53

scripts/export_lexical_defense_snapshot.py (1)

11-58: CLI export flow is clean and contract-aligned.

Format branching is explicit, output encoding is correct, and filesystem preparation is handled safely.

src/brainlayer/lexical_defense.py (1)

12-47: Normalization, typed entry model, and export helpers are well-structured.

Also applies to: 63-141

coderabbitai · 2026-04-29T23:01:59Z

+    def _build_surface_index(self) -> dict[str, LexicalDefenseEntry]:
+        index: dict[str, LexicalDefenseEntry] = {}
+        for entry in self.entries:
+            for surface in entry.all_surfaces:
+                index[_normalize_surface(surface)] = entry
+        return index
+
+    def _build_replacement_map(self) -> dict[str, str]:
+        pairs: dict[str, str] = {}
+        for entry in self.entries:
+            for split_form in entry.split_forms:
+                pairs[split_form] = entry.canonical
+        return dict(sorted(pairs.items(), key=lambda item: (-len(item[0]), item[0])))


🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick win

Fail fast on key collisions instead of silently overwriting.

At Line 53 and Line 60, duplicate normalized surfaces or split forms overwrite previous mappings without signal. Add conflict checks so bad dictionary rows are caught at load time.

Suggested guardrails

def _build_surface_index(self) -> dict[str, LexicalDefenseEntry]: index: dict[str, LexicalDefenseEntry] = {} for entry in self.entries: for surface in entry.all_surfaces: - index[_normalize_surface(surface)] = entry + normalized = _normalize_surface(surface) + existing = index.get(normalized) + if existing is not None and existing.canonical != entry.canonical: + raise ValueError( + f"Normalized surface collision: {surface!r} maps to both " + f"{existing.canonical!r} and {entry.canonical!r}" + ) + index[normalized] = entry return index def _build_replacement_map(self) -> dict[str, str]: pairs: dict[str, str] = {} for entry in self.entries: for split_form in entry.split_forms: - pairs[split_form] = entry.canonical + existing = pairs.get(split_form) + if existing is not None and existing != entry.canonical: + raise ValueError( + f"Split-form collision: {split_form!r} maps to both " + f"{existing!r} and {entry.canonical!r}" + ) + pairs[split_form] = entry.canonical return dict(sorted(pairs.items(), key=lambda item: (-len(item[0]), item[0])))

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/brainlayer/lexical_defense.py` around lines 49 - 61, In _build_surface_index and _build_replacement_map, detect duplicate keys before inserting: when iterating self.entries, compute the key (use _normalize_surface(surface) in _build_surface_index and split_form in _build_replacement_map) and if the key already exists in the local dict raise a ValueError (or custom exception) that includes the conflicting key and references to the existing and new LexicalDefenseEntry.canonical (or entry) to fail fast instead of silently overwriting; keep the final sorting behavior in _build_replacement_map after validation.

coderabbitai · 2026-04-29T23:01:59Z

+    assert patterns[0]["priority"] >= patterns[-1]["priority"]
+    assert {"match": "brain layer", "replacement": "BrainLayer", "priority": 100} in patterns
+    assert {"match": "voice layer", "replacement": "VoiceLayer", "priority": 100} in patterns


🧹 Nitpick | 🔵 Trivial | ⚡ Quick win

Strengthen priority-order validation to catch mid-list regressions.

Line 31 only compares endpoints; a misordered middle section would still pass.

Suggested test tightening

- assert patterns[0]["priority"] >= patterns[-1]["priority"] + assert all( + patterns[i]["priority"] >= patterns[i + 1]["priority"] + for i in range(len(patterns) - 1) + )

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

assert patterns[0]["priority"] >= patterns[-1]["priority"]

assert {"match": "brain layer", "replacement": "BrainLayer", "priority": 100} in patterns

assert {"match": "voice layer", "replacement": "VoiceLayer", "priority": 100} in patterns

assert all(

patterns[i]["priority"] >= patterns[i + 1]["priority"]

for i in range(len(patterns) - 1)

)

assert {"match": "brain layer", "replacement": "BrainLayer", "priority": 100} in patterns

assert {"match": "voice layer", "replacement": "VoiceLayer", "priority": 100} in patterns

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@tests/test_lexical_defense.py` around lines 31 - 33, The test currently only checks the first and last pattern priorities; tighten it by asserting the entire patterns list is non-increasing: iterate the patterns list and for each adjacent pair ensure patterns[i]["priority"] >= patterns[i+1]["priority"] (use the existing patterns variable), so any mid-list regression fails; keep the other membership assertions for "brain layer"/"voice layer" as-is.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f08b22d333

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-29T23:02:06Z

+import json
+from pathlib import Path
+
+from brainlayer.lexical_defense import load_lexical_defense_dictionary


Make export script resolve the local brainlayer package

The new script imports brainlayer.lexical_defense directly, but unlike other repo scripts it does not add src/ to sys.path, so running it from a checkout (python scripts/export_lexical_defense_snapshot.py --output ...) fails with ModuleNotFoundError unless the package is already installed with this exact commit. In environments with an older installed brainlayer, it can also import the wrong version and miss lexical_defense entirely, which breaks the export workflow this commit introduces.

Useful? React with 👍 / 👎.

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 61c2d43. Configure here.}

cursor · 2026-04-29T23:10:35Z

+        return sorted(literals, key=lambda value: (-len(value), value.casefold()))
+
+    def slm_entity_lines(self) -> list[str]:
+        return [f"- {entry.canonical} [{entry.category}]" for entry in self.entries if entry.protect_from_split]


Unused methods and attribute never called or tested

Low Severity

grammar_literals and slm_entity_lines are defined but have zero callers anywhere in the codebase (not in the export script, not in tests, not imported by other modules). Similarly, self.by_canonical is computed at construction time but never read. These add untested surface area and maintenance burden without current utility—the PR only delivers VoiceLayer snapshot and GBNF exports.

Additional Locations (1)

src/brainlayer/lexical_defense.py#L44-L45

^{Reviewed by Cursor Bugbot for commit 61c2d43. Configure here.}

feat: add lexical defense dictionary exports

f08b22d

style: format lexical defense module

61c2d43

macroscopeapp Bot reviewed Apr 29, 2026

View reviewed changes

coderabbitai Bot reviewed Apr 29, 2026

View reviewed changes

chatgpt-codex-connector Bot reviewed Apr 29, 2026

View reviewed changes

cursor Bot reviewed Apr 29, 2026

View reviewed changes

EtanHey merged commit 158c963 into main Apr 30, 2026
7 checks passed

EtanHey deleted the feat/lexical-defense-dictionary branch April 30, 2026 06:08

coderabbitai Bot mentioned this pull request Apr 30, 2026

fix: harden BrainLayer FTS recall across all three layers #263

Merged

EtanHey mentioned this pull request May 2, 2026

docs(readme): refresh with 2026-04-15 → 2026-05-02 hardening sprint #267

Merged

2 tasks

Conversation

EtanHey commented Apr 29, 2026 • edited by macroscopeapp Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Verification

Summary by CodeRabbit

Add lexical defense dictionary with multi-format export script

Uh oh!

EtanHey commented Apr 29, 2026

Uh oh!

coderabbitai Bot commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated Code Review Effort

Poem

❌ Failed checks (1 warning)

Uh oh!

macroscopeapp Bot Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot Apr 29, 2026

Choose a reason for hiding this comment

Unused methods and attribute never called or tested

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

EtanHey commented Apr 29, 2026 •

edited by macroscopeapp Bot

Loading

coderabbitai Bot commented Apr 29, 2026 •

edited

Loading