feat: add lexical defense dictionary exports#262
Conversation
📝 WalkthroughWalkthroughA new lexical defense system is introduced, comprising a JSON dictionary of protected entities, a Python module providing lookup and export functionality across multiple formats (JSON, voicelayer, GBNF), a CLI script for exporting snapshots, and comprehensive tests validating the system's behavior. Changes
Sequence DiagramsequenceDiagram
participant CLI as Export CLI Script
participant Loader as Dictionary Loader
participant Dict as LexicalDefenseDictionary
participant JSON as JSON File
participant Out as Output File
CLI->>Loader: load_lexical_defense_dictionary()
Loader->>JSON: read JSON (cached)
JSON-->>Loader: entries + metadata
Loader->>Dict: instantiate with parsed entries
Dict->>Dict: normalize surfaces<br/>build indices
Loader-->>CLI: LexicalDefenseDictionary instance
CLI->>Dict: select format (json/voicelayer/gbnf)
alt format == json
Dict-->>CLI: structured dict with metadata
else format == voicelayer
Dict-->>CLI: voicelayer_snapshot() dict
else format == gbnf
Dict-->>CLI: whisper_entity_gbnf() text
end
CLI->>Out: write formatted output
Out-->>CLI: success
Estimated Code Review Effort🎯 3 (Moderate) | ⏱️ ~22 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
| def whisper_entity_gbnf(self) -> str: | ||
| protected = [entry for entry in self.entries if entry.protect_from_split] | ||
| lines = ["root ::= protected_entity", ""] | ||
| lines.append( | ||
| "protected_entity ::= " | ||
| + " | ".join(f"entity_{index}" for index, _entry in enumerate(protected)) | ||
| ) | ||
| lines.append("") | ||
| for index, entry in enumerate(protected): | ||
| literal = entry.canonical.replace("\\", "\\\\").replace('"', '\\"') | ||
| lines.append(f'entity_{index} ::= "{literal}"') | ||
| return "\n".join(lines) |
There was a problem hiding this comment.
🟢 Low brainlayer/lexical_defense.py:102
whisper_entity_gbnf returns invalid GBNF when no entries have protect_from_split=True. The protected list is empty, so " | ".join(...) produces an empty string and the grammar contains protected_entity ::= with no right-hand side, which is syntactically invalid. Consider handling the empty case by returning an empty string, raising an error, or generating a valid fallback rule.
def whisper_entity_gbnf(self) -> str:
protected = [entry for entry in self.entries if entry.protect_from_split]
+ if not protected:
+ return ""
lines = ["root ::= protected_entity", ""]🚀 Reply "fix it for me" or copy this AI Prompt for your agent:
In file src/brainlayer/lexical_defense.py around lines 102-113:
`whisper_entity_gbnf` returns invalid GBNF when no entries have `protect_from_split=True`. The `protected` list is empty, so `" | ".join(...)` produces an empty string and the grammar contains `protected_entity ::= ` with no right-hand side, which is syntactically invalid. Consider handling the empty case by returning an empty string, raising an error, or generating a valid fallback rule.
Evidence trail:
src/brainlayer/lexical_defense.py lines 102-113 (REVIEWED_COMMIT): whisper_entity_gbnf filters entries by protect_from_split, joins with " | " which produces empty string when list is empty.
src/brainlayer/lexical_defense.py lines 19-27: LexicalDefenseEntry dataclass with protect_from_split as a plain bool field.
scripts/export_lexical_defense_snapshot.py line 53: caller writes whisper_entity_gbnf() output without checking for empty protected entries.
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/brainlayer/lexical_defense.py`:
- Around line 49-61: In _build_surface_index and _build_replacement_map, detect
duplicate keys before inserting: when iterating self.entries, compute the key
(use _normalize_surface(surface) in _build_surface_index and split_form in
_build_replacement_map) and if the key already exists in the local dict raise a
ValueError (or custom exception) that includes the conflicting key and
references to the existing and new LexicalDefenseEntry.canonical (or entry) to
fail fast instead of silently overwriting; keep the final sorting behavior in
_build_replacement_map after validation.
In `@tests/test_lexical_defense.py`:
- Around line 31-33: The test currently only checks the first and last pattern
priorities; tighten it by asserting the entire patterns list is non-increasing:
iterate the patterns list and for each adjacent pair ensure
patterns[i]["priority"] >= patterns[i+1]["priority"] (use the existing patterns
variable), so any mid-list regression fails; keep the other membership
assertions for "brain layer"/"voice layer" as-is.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: ef6bd45b-d6a7-43f0-a104-e8b1fa06cb63
📒 Files selected for processing (4)
scripts/export_lexical_defense_snapshot.pysrc/brainlayer/lexical_defense.pysrc/brainlayer/lexical_defense_dictionary.jsontests/test_lexical_defense.py
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
- GitHub Check: Cursor Bugbot
- GitHub Check: Macroscope - Correctness Check
- GitHub Check: test (3.13)
- GitHub Check: test (3.11)
- GitHub Check: test (3.12)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
**/*.py: Flag risky DB or concurrency changes explicitly and do not hand-wave lock behavior
Enforce one-write-at-a-time concurrency constraint; reads are safe but brain_digest is write-heavy and must not run in parallel with other MCP work
Run pytest before claiming behavior changed safely; current test suite has 929 tests
**/*.py: Usepaths.py:get_db_path()for all database path resolution; all scripts and CLI must use this function rather than hardcoding paths
When performing bulk database operations: stop enrichment workers first, checkpoint WAL before and after, drop FTS triggers before bulk deletes, batch deletes in 5-10K chunks, and checkpoint every 3 batches
Files:
scripts/export_lexical_defense_snapshot.pytests/test_lexical_defense.pysrc/brainlayer/lexical_defense.py
src/brainlayer/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
src/brainlayer/**/*.py: Use retry logic onSQLITE_BUSYerrors; each worker must use its own database connection to handle concurrency safely
Classification must preserveai_code,stack_trace, anduser_messageverbatim; skipnoiseentries entirely and summarizebuild_loganddir_listingentries (structure only)
Use AST-aware chunking via tree-sitter; never split stack traces; mask large tool output
For enrichment backend selection: use Groq as primary backend (cloud, configured in launchd plist), Gemini as fallback viaenrichment_controller.py, and Ollama as offline last-resort; allow override viaBRAINLAYER_ENRICH_BACKENDenv var
Configure enrichment rate viaBRAINLAYER_ENRICH_RATEenvironment variable (default 0.2 = 12 RPM)
Implement chunk lifecycle columns:superseded_by,aggregated_into,archived_aton chunks table; exclude lifecycle-managed chunks from default search; allowinclude_archived=Trueto show history
Implementbrain_supersedewith safety gate for personal data (journals, notes, health/finance); use soft-delete forbrain_archivewith timestamp
Addsupersedesparameter tobrain_storefor atomic store-and-replace operations
Run linting and formatting with:ruff check src/ && ruff format src/
Run tests withpytest
UsePRAGMA wal_checkpoint(FULL)before and after bulk database operations to prevent WAL bloat
Files:
src/brainlayer/lexical_defense.py
🔇 Additional comments (4)
src/brainlayer/lexical_defense_dictionary.json (1)
1-246: Dictionary schema and term coverage look consistent.The dataset shape is uniform and aligns with the new lexical-defense use cases (aliases/split forms/priority/source provenance).
tests/test_lexical_defense.py (1)
4-24: Coverage for lookup, Hebrew entries, snapshot, and GBNF is strong.Also applies to: 36-53
scripts/export_lexical_defense_snapshot.py (1)
11-58: CLI export flow is clean and contract-aligned.Format branching is explicit, output encoding is correct, and filesystem preparation is handled safely.
src/brainlayer/lexical_defense.py (1)
12-47: Normalization, typed entry model, and export helpers are well-structured.Also applies to: 63-141
| def _build_surface_index(self) -> dict[str, LexicalDefenseEntry]: | ||
| index: dict[str, LexicalDefenseEntry] = {} | ||
| for entry in self.entries: | ||
| for surface in entry.all_surfaces: | ||
| index[_normalize_surface(surface)] = entry | ||
| return index | ||
|
|
||
| def _build_replacement_map(self) -> dict[str, str]: | ||
| pairs: dict[str, str] = {} | ||
| for entry in self.entries: | ||
| for split_form in entry.split_forms: | ||
| pairs[split_form] = entry.canonical | ||
| return dict(sorted(pairs.items(), key=lambda item: (-len(item[0]), item[0]))) |
There was a problem hiding this comment.
🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick win
Fail fast on key collisions instead of silently overwriting.
At Line 53 and Line 60, duplicate normalized surfaces or split forms overwrite previous mappings without signal. Add conflict checks so bad dictionary rows are caught at load time.
Suggested guardrails
def _build_surface_index(self) -> dict[str, LexicalDefenseEntry]:
index: dict[str, LexicalDefenseEntry] = {}
for entry in self.entries:
for surface in entry.all_surfaces:
- index[_normalize_surface(surface)] = entry
+ normalized = _normalize_surface(surface)
+ existing = index.get(normalized)
+ if existing is not None and existing.canonical != entry.canonical:
+ raise ValueError(
+ f"Normalized surface collision: {surface!r} maps to both "
+ f"{existing.canonical!r} and {entry.canonical!r}"
+ )
+ index[normalized] = entry
return index
def _build_replacement_map(self) -> dict[str, str]:
pairs: dict[str, str] = {}
for entry in self.entries:
for split_form in entry.split_forms:
- pairs[split_form] = entry.canonical
+ existing = pairs.get(split_form)
+ if existing is not None and existing != entry.canonical:
+ raise ValueError(
+ f"Split-form collision: {split_form!r} maps to both "
+ f"{existing!r} and {entry.canonical!r}"
+ )
+ pairs[split_form] = entry.canonical
return dict(sorted(pairs.items(), key=lambda item: (-len(item[0]), item[0])))🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/brainlayer/lexical_defense.py` around lines 49 - 61, In
_build_surface_index and _build_replacement_map, detect duplicate keys before
inserting: when iterating self.entries, compute the key (use
_normalize_surface(surface) in _build_surface_index and split_form in
_build_replacement_map) and if the key already exists in the local dict raise a
ValueError (or custom exception) that includes the conflicting key and
references to the existing and new LexicalDefenseEntry.canonical (or entry) to
fail fast instead of silently overwriting; keep the final sorting behavior in
_build_replacement_map after validation.
| assert patterns[0]["priority"] >= patterns[-1]["priority"] | ||
| assert {"match": "brain layer", "replacement": "BrainLayer", "priority": 100} in patterns | ||
| assert {"match": "voice layer", "replacement": "VoiceLayer", "priority": 100} in patterns |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial | ⚡ Quick win
Strengthen priority-order validation to catch mid-list regressions.
Line 31 only compares endpoints; a misordered middle section would still pass.
Suggested test tightening
- assert patterns[0]["priority"] >= patterns[-1]["priority"]
+ assert all(
+ patterns[i]["priority"] >= patterns[i + 1]["priority"]
+ for i in range(len(patterns) - 1)
+ )📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| assert patterns[0]["priority"] >= patterns[-1]["priority"] | |
| assert {"match": "brain layer", "replacement": "BrainLayer", "priority": 100} in patterns | |
| assert {"match": "voice layer", "replacement": "VoiceLayer", "priority": 100} in patterns | |
| assert all( | |
| patterns[i]["priority"] >= patterns[i + 1]["priority"] | |
| for i in range(len(patterns) - 1) | |
| ) | |
| assert {"match": "brain layer", "replacement": "BrainLayer", "priority": 100} in patterns | |
| assert {"match": "voice layer", "replacement": "VoiceLayer", "priority": 100} in patterns |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/test_lexical_defense.py` around lines 31 - 33, The test currently only
checks the first and last pattern priorities; tighten it by asserting the entire
patterns list is non-increasing: iterate the patterns list and for each adjacent
pair ensure patterns[i]["priority"] >= patterns[i+1]["priority"] (use the
existing patterns variable), so any mid-list regression fails; keep the other
membership assertions for "brain layer"/"voice layer" as-is.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: f08b22d333
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| import json | ||
| from pathlib import Path | ||
|
|
||
| from brainlayer.lexical_defense import load_lexical_defense_dictionary |
There was a problem hiding this comment.
Make export script resolve the local brainlayer package
The new script imports brainlayer.lexical_defense directly, but unlike other repo scripts it does not add src/ to sys.path, so running it from a checkout (python scripts/export_lexical_defense_snapshot.py --output ...) fails with ModuleNotFoundError unless the package is already installed with this exact commit. In environments with an older installed brainlayer, it can also import the wrong version and miss lexical_defense entirely, which breaks the export workflow this commit introduces.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 61c2d43. Configure here.
| return sorted(literals, key=lambda value: (-len(value), value.casefold())) | ||
|
|
||
| def slm_entity_lines(self) -> list[str]: | ||
| return [f"- {entry.canonical} [{entry.category}]" for entry in self.entries if entry.protect_from_split] |
There was a problem hiding this comment.
Unused methods and attribute never called or tested
Low Severity
grammar_literals and slm_entity_lines are defined but have zero callers anywhere in the codebase (not in the export script, not in tests, not imported by other modules). Similarly, self.by_canonical is computed at construction time but never read. These add untested surface area and maintenance burden without current utility—the PR only delivers VoiceLayer snapshot and GBNF exports.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit 61c2d43. Configure here.


Summary
Verification
pytest -q tests/test_lexical_defense.pyruff check src/brainlayer/lexical_defense.py tests/test_lexical_defense.py scripts/export_lexical_defense_snapshot.pypytest unit suite=>1787 passed, 2 skipped, 75 deselected, 1 xfailedpytest MCP tool registration=>3 passedpytest isolated eval and hook routing=>32 passedbun test suite=>1 passtest_fts5_determinism.sh=> passedNote
Low Risk
Purely additive (new module, data file, CLI, and tests) with no changes to existing runtime paths unless adopted by callers; main risk is downstream consumers relying on export/lookup semantics.
Overview
Adds a checked-in lexical defense dictionary (
lexical_defense_dictionary.json) plus a newbrainlayer.lexical_defensemodule that loads it (LRU-cached), normalizes/looks up surfaces, and generates downstream artifacts (Swift override patterns, VoiceLayer snapshot JSON, and a Whisper protected-entity GBNF).Introduces a CLI script
export_lexical_defense_snapshot.pyto export those artifacts injson,voicelayer, orgbnfformats, and adds unit tests covering lookup normalization (incl. Hebrew), export shapes, and ordering guarantees.Reviewed by Cursor Bugbot for commit 61c2d43. Bugbot is set up for automated code reviews on this repo. Configure here.
Summary by CodeRabbit
Note
Add lexical defense dictionary with multi-format export script
LexicalDefenseDictionarywrapping a JSON data file with O(1) surface lookups, a pre-sorted replacement map, and methods for generating VoiceLayer snapshots, GBNF grammars, and Swift override patterns.export_lexical_defense_snapshot.py, a CLI script that loads the dictionary and writes output injson,voicelayer, orgbnfformat to a specified path.lru_cache-backed so repeated calls reuse the in-memory instance.Macroscope summarized 61c2d43.