Skip to content

feat: add lexical defense dictionary exports#262

Merged
EtanHey merged 2 commits intomainfrom
feat/lexical-defense-dictionary
Apr 30, 2026
Merged

feat: add lexical defense dictionary exports#262
EtanHey merged 2 commits intomainfrom
feat/lexical-defense-dictionary

Conversation

@EtanHey
Copy link
Copy Markdown
Owner

@EtanHey EtanHey commented Apr 29, 2026

Summary

  • add a checked-in lexical-defense dictionary for Hebrew names, proper nouns, and recurring ecosystem terms
  • add BrainLayer exports for VoiceLayer snapshot JSON and entity-only whisper GBNF
  • add contract tests for lookup, Hebrew entries, Swift override ordering, and downstream export shapes

Verification

  • pytest -q tests/test_lexical_defense.py
  • ruff check src/brainlayer/lexical_defense.py tests/test_lexical_defense.py scripts/export_lexical_defense_snapshot.py
  • full pre-push gate passed:
    • pytest unit suite => 1787 passed, 2 skipped, 75 deselected, 1 xfailed
    • pytest MCP tool registration => 3 passed
    • pytest isolated eval and hook routing => 32 passed
    • bun test suite => 1 pass
    • test_fts5_determinism.sh => passed

Note

Low Risk
Purely additive (new module, data file, CLI, and tests) with no changes to existing runtime paths unless adopted by callers; main risk is downstream consumers relying on export/lookup semantics.

Overview
Adds a checked-in lexical defense dictionary (lexical_defense_dictionary.json) plus a new brainlayer.lexical_defense module that loads it (LRU-cached), normalizes/looks up surfaces, and generates downstream artifacts (Swift override patterns, VoiceLayer snapshot JSON, and a Whisper protected-entity GBNF).

Introduces a CLI script export_lexical_defense_snapshot.py to export those artifacts in json, voicelayer, or gbnf formats, and adds unit tests covering lookup normalization (incl. Hebrew), export shapes, and ordering guarantees.

Reviewed by Cursor Bugbot for commit 61c2d43. Bugbot is set up for automated code reviews on this repo. Configure here.

Summary by CodeRabbit

  • New Features
    • Introduced a lexical defense system to protect critical entities and terms from splitting or modification during processing.
    • Added export functionality for the protected entities dictionary, supporting JSON, VoiceLayer, and GBNF grammar formats.
    • Deployed a comprehensive dictionary of protected terms including domain entities, developer terminology, and person/entity names.

Note

Add lexical defense dictionary with multi-format export script

  • Introduces LexicalDefenseDictionary wrapping a JSON data file with O(1) surface lookups, a pre-sorted replacement map, and methods for generating VoiceLayer snapshots, GBNF grammars, and Swift override patterns.
  • Adds export_lexical_defense_snapshot.py, a CLI script that loads the dictionary and writes output in json, voicelayer, or gbnf format to a specified path.
  • The dictionary loader is lru_cache-backed so repeated calls reuse the in-memory instance.
  • Surfaces are normalized via NFKC + casefolding before indexing to ensure consistent lookup behavior.

Macroscope summarized 61c2d43.

@EtanHey
Copy link
Copy Markdown
Owner Author

EtanHey commented Apr 29, 2026

@codex review
@cursor review
@BugBot review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 29, 2026

📝 Walkthrough

Walkthrough

A new lexical defense system is introduced, comprising a JSON dictionary of protected entities, a Python module providing lookup and export functionality across multiple formats (JSON, voicelayer, GBNF), a CLI script for exporting snapshots, and comprehensive tests validating the system's behavior.

Changes

Cohort / File(s) Summary
Lexical Defense Core Module
src/brainlayer/lexical_defense.py
New module introducing LexicalDefenseEntry dataclass and LexicalDefenseDictionary class with surface normalization (NFKC + casefold + alphanumeric filtering), entity lookup via canonical/alias/split-form indexing, and six public methods for generating grammar literals, SLM entity lines, Swift override patterns (sorted by descending priority), voicelayer snapshots, and GBNF grammar definitions. Includes cached dictionary loader.
Lexical Defense Data
src/brainlayer/lexical_defense_dictionary.json
New JSON data file containing protected entities with version, generation timestamp, and 246 lines of entries covering domain terms (BrainLayer, VoiceLayer, etc.), developer-specific terms (repoGolem, orcClaude, cmux), and person/entity names in multiple scripts including Hebrew, along with metadata for each entry (category, script, split-protection flags, priorities, aliases, split forms, sources).
Export CLI Script
scripts/export_lexical_defense_snapshot.py
New Python CLI entrypoint supporting three output formats via --format flag: json (structured payload with metadata and per-entry fields), voicelayer (voicelayer\_snapshot() JSON serialization), and gbnf (whisper entity grammar text). Includes argument parsing via argparse, directory creation, and success exit code 0.
Lexical Defense Tests
tests/test_lexical_defense.py
New test module validating dictionary loading, lookup functionality for canonicals/aliases/split-forms, Hebrew entry presence and behavior, Swift override pattern sorting by priority, voicelayer snapshot structure and content, and GBNF grammar rule presence.

Sequence Diagram

sequenceDiagram
    participant CLI as Export CLI Script
    participant Loader as Dictionary Loader
    participant Dict as LexicalDefenseDictionary
    participant JSON as JSON File
    participant Out as Output File

    CLI->>Loader: load_lexical_defense_dictionary()
    Loader->>JSON: read JSON (cached)
    JSON-->>Loader: entries + metadata
    Loader->>Dict: instantiate with parsed entries
    Dict->>Dict: normalize surfaces<br/>build indices
    Loader-->>CLI: LexicalDefenseDictionary instance
    
    CLI->>Dict: select format (json/voicelayer/gbnf)
    alt format == json
        Dict-->>CLI: structured dict with metadata
    else format == voicelayer
        Dict-->>CLI: voicelayer_snapshot() dict
    else format == gbnf
        Dict-->>CLI: whisper_entity_gbnf() text
    end
    
    CLI->>Out: write formatted output
    Out-->>CLI: success
Loading

Estimated Code Review Effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

Poem

🐰 A defense of lexicons we sow,
Where surfaces and aliases flow,
In JSON beds and Python roots,
We guard each entity's sacred shoots! 🌿
From NFKC to GBNF so true,
Protected terms in every view! ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title 'feat: add lexical defense dictionary exports' accurately and concisely summarizes the main change: introducing a lexical defense dictionary and its associated export functionality through a new CLI script.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/lexical-defense-dictionary

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Comment on lines +102 to +113
def whisper_entity_gbnf(self) -> str:
protected = [entry for entry in self.entries if entry.protect_from_split]
lines = ["root ::= protected_entity", ""]
lines.append(
"protected_entity ::= "
+ " | ".join(f"entity_{index}" for index, _entry in enumerate(protected))
)
lines.append("")
for index, entry in enumerate(protected):
literal = entry.canonical.replace("\\", "\\\\").replace('"', '\\"')
lines.append(f'entity_{index} ::= "{literal}"')
return "\n".join(lines)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 Low brainlayer/lexical_defense.py:102

whisper_entity_gbnf returns invalid GBNF when no entries have protect_from_split=True. The protected list is empty, so " | ".join(...) produces an empty string and the grammar contains protected_entity ::= with no right-hand side, which is syntactically invalid. Consider handling the empty case by returning an empty string, raising an error, or generating a valid fallback rule.

    def whisper_entity_gbnf(self) -> str:
        protected = [entry for entry in self.entries if entry.protect_from_split]
+        if not protected:
+            return ""
        lines = ["root ::= protected_entity", ""]
🚀 Reply "fix it for me" or copy this AI Prompt for your agent:
In file src/brainlayer/lexical_defense.py around lines 102-113:

`whisper_entity_gbnf` returns invalid GBNF when no entries have `protect_from_split=True`. The `protected` list is empty, so `" | ".join(...)` produces an empty string and the grammar contains `protected_entity ::= ` with no right-hand side, which is syntactically invalid. Consider handling the empty case by returning an empty string, raising an error, or generating a valid fallback rule.

Evidence trail:
src/brainlayer/lexical_defense.py lines 102-113 (REVIEWED_COMMIT): whisper_entity_gbnf filters entries by protect_from_split, joins with " | " which produces empty string when list is empty.
src/brainlayer/lexical_defense.py lines 19-27: LexicalDefenseEntry dataclass with protect_from_split as a plain bool field.
scripts/export_lexical_defense_snapshot.py line 53: caller writes whisper_entity_gbnf() output without checking for empty protected entries.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/brainlayer/lexical_defense.py`:
- Around line 49-61: In _build_surface_index and _build_replacement_map, detect
duplicate keys before inserting: when iterating self.entries, compute the key
(use _normalize_surface(surface) in _build_surface_index and split_form in
_build_replacement_map) and if the key already exists in the local dict raise a
ValueError (or custom exception) that includes the conflicting key and
references to the existing and new LexicalDefenseEntry.canonical (or entry) to
fail fast instead of silently overwriting; keep the final sorting behavior in
_build_replacement_map after validation.

In `@tests/test_lexical_defense.py`:
- Around line 31-33: The test currently only checks the first and last pattern
priorities; tighten it by asserting the entire patterns list is non-increasing:
iterate the patterns list and for each adjacent pair ensure
patterns[i]["priority"] >= patterns[i+1]["priority"] (use the existing patterns
variable), so any mid-list regression fails; keep the other membership
assertions for "brain layer"/"voice layer" as-is.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: ef6bd45b-d6a7-43f0-a104-e8b1fa06cb63

📥 Commits

Reviewing files that changed from the base of the PR and between 4bebb81 and f08b22d.

📒 Files selected for processing (4)
  • scripts/export_lexical_defense_snapshot.py
  • src/brainlayer/lexical_defense.py
  • src/brainlayer/lexical_defense_dictionary.json
  • tests/test_lexical_defense.py
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
  • GitHub Check: Cursor Bugbot
  • GitHub Check: Macroscope - Correctness Check
  • GitHub Check: test (3.13)
  • GitHub Check: test (3.11)
  • GitHub Check: test (3.12)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

**/*.py: Flag risky DB or concurrency changes explicitly and do not hand-wave lock behavior
Enforce one-write-at-a-time concurrency constraint; reads are safe but brain_digest is write-heavy and must not run in parallel with other MCP work
Run pytest before claiming behavior changed safely; current test suite has 929 tests

**/*.py: Use paths.py:get_db_path() for all database path resolution; all scripts and CLI must use this function rather than hardcoding paths
When performing bulk database operations: stop enrichment workers first, checkpoint WAL before and after, drop FTS triggers before bulk deletes, batch deletes in 5-10K chunks, and checkpoint every 3 batches

Files:

  • scripts/export_lexical_defense_snapshot.py
  • tests/test_lexical_defense.py
  • src/brainlayer/lexical_defense.py
src/brainlayer/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

src/brainlayer/**/*.py: Use retry logic on SQLITE_BUSY errors; each worker must use its own database connection to handle concurrency safely
Classification must preserve ai_code, stack_trace, and user_message verbatim; skip noise entries entirely and summarize build_log and dir_listing entries (structure only)
Use AST-aware chunking via tree-sitter; never split stack traces; mask large tool output
For enrichment backend selection: use Groq as primary backend (cloud, configured in launchd plist), Gemini as fallback via enrichment_controller.py, and Ollama as offline last-resort; allow override via BRAINLAYER_ENRICH_BACKEND env var
Configure enrichment rate via BRAINLAYER_ENRICH_RATE environment variable (default 0.2 = 12 RPM)
Implement chunk lifecycle columns: superseded_by, aggregated_into, archived_at on chunks table; exclude lifecycle-managed chunks from default search; allow include_archived=True to show history
Implement brain_supersede with safety gate for personal data (journals, notes, health/finance); use soft-delete for brain_archive with timestamp
Add supersedes parameter to brain_store for atomic store-and-replace operations
Run linting and formatting with: ruff check src/ && ruff format src/
Run tests with pytest
Use PRAGMA wal_checkpoint(FULL) before and after bulk database operations to prevent WAL bloat

Files:

  • src/brainlayer/lexical_defense.py
🔇 Additional comments (4)
src/brainlayer/lexical_defense_dictionary.json (1)

1-246: Dictionary schema and term coverage look consistent.

The dataset shape is uniform and aligns with the new lexical-defense use cases (aliases/split forms/priority/source provenance).

tests/test_lexical_defense.py (1)

4-24: Coverage for lookup, Hebrew entries, snapshot, and GBNF is strong.

Also applies to: 36-53

scripts/export_lexical_defense_snapshot.py (1)

11-58: CLI export flow is clean and contract-aligned.

Format branching is explicit, output encoding is correct, and filesystem preparation is handled safely.

src/brainlayer/lexical_defense.py (1)

12-47: Normalization, typed entry model, and export helpers are well-structured.

Also applies to: 63-141

Comment on lines +49 to +61
def _build_surface_index(self) -> dict[str, LexicalDefenseEntry]:
index: dict[str, LexicalDefenseEntry] = {}
for entry in self.entries:
for surface in entry.all_surfaces:
index[_normalize_surface(surface)] = entry
return index

def _build_replacement_map(self) -> dict[str, str]:
pairs: dict[str, str] = {}
for entry in self.entries:
for split_form in entry.split_forms:
pairs[split_form] = entry.canonical
return dict(sorted(pairs.items(), key=lambda item: (-len(item[0]), item[0])))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick win

Fail fast on key collisions instead of silently overwriting.

At Line 53 and Line 60, duplicate normalized surfaces or split forms overwrite previous mappings without signal. Add conflict checks so bad dictionary rows are caught at load time.

Suggested guardrails
 def _build_surface_index(self) -> dict[str, LexicalDefenseEntry]:
     index: dict[str, LexicalDefenseEntry] = {}
     for entry in self.entries:
         for surface in entry.all_surfaces:
-            index[_normalize_surface(surface)] = entry
+            normalized = _normalize_surface(surface)
+            existing = index.get(normalized)
+            if existing is not None and existing.canonical != entry.canonical:
+                raise ValueError(
+                    f"Normalized surface collision: {surface!r} maps to both "
+                    f"{existing.canonical!r} and {entry.canonical!r}"
+                )
+            index[normalized] = entry
     return index

 def _build_replacement_map(self) -> dict[str, str]:
     pairs: dict[str, str] = {}
     for entry in self.entries:
         for split_form in entry.split_forms:
-            pairs[split_form] = entry.canonical
+            existing = pairs.get(split_form)
+            if existing is not None and existing != entry.canonical:
+                raise ValueError(
+                    f"Split-form collision: {split_form!r} maps to both "
+                    f"{existing!r} and {entry.canonical!r}"
+                )
+            pairs[split_form] = entry.canonical
     return dict(sorted(pairs.items(), key=lambda item: (-len(item[0]), item[0])))
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/brainlayer/lexical_defense.py` around lines 49 - 61, In
_build_surface_index and _build_replacement_map, detect duplicate keys before
inserting: when iterating self.entries, compute the key (use
_normalize_surface(surface) in _build_surface_index and split_form in
_build_replacement_map) and if the key already exists in the local dict raise a
ValueError (or custom exception) that includes the conflicting key and
references to the existing and new LexicalDefenseEntry.canonical (or entry) to
fail fast instead of silently overwriting; keep the final sorting behavior in
_build_replacement_map after validation.

Comment on lines +31 to +33
assert patterns[0]["priority"] >= patterns[-1]["priority"]
assert {"match": "brain layer", "replacement": "BrainLayer", "priority": 100} in patterns
assert {"match": "voice layer", "replacement": "VoiceLayer", "priority": 100} in patterns
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial | ⚡ Quick win

Strengthen priority-order validation to catch mid-list regressions.

Line 31 only compares endpoints; a misordered middle section would still pass.

Suggested test tightening
-    assert patterns[0]["priority"] >= patterns[-1]["priority"]
+    assert all(
+        patterns[i]["priority"] >= patterns[i + 1]["priority"]
+        for i in range(len(patterns) - 1)
+    )
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
assert patterns[0]["priority"] >= patterns[-1]["priority"]
assert {"match": "brain layer", "replacement": "BrainLayer", "priority": 100} in patterns
assert {"match": "voice layer", "replacement": "VoiceLayer", "priority": 100} in patterns
assert all(
patterns[i]["priority"] >= patterns[i + 1]["priority"]
for i in range(len(patterns) - 1)
)
assert {"match": "brain layer", "replacement": "BrainLayer", "priority": 100} in patterns
assert {"match": "voice layer", "replacement": "VoiceLayer", "priority": 100} in patterns
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/test_lexical_defense.py` around lines 31 - 33, The test currently only
checks the first and last pattern priorities; tighten it by asserting the entire
patterns list is non-increasing: iterate the patterns list and for each adjacent
pair ensure patterns[i]["priority"] >= patterns[i+1]["priority"] (use the
existing patterns variable), so any mid-list regression fails; keep the other
membership assertions for "brain layer"/"voice layer" as-is.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f08b22d333

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

import json
from pathlib import Path

from brainlayer.lexical_defense import load_lexical_defense_dictionary
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Make export script resolve the local brainlayer package

The new script imports brainlayer.lexical_defense directly, but unlike other repo scripts it does not add src/ to sys.path, so running it from a checkout (python scripts/export_lexical_defense_snapshot.py --output ...) fails with ModuleNotFoundError unless the package is already installed with this exact commit. In environments with an older installed brainlayer, it can also import the wrong version and miss lexical_defense entirely, which breaks the export workflow this commit introduces.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 61c2d43. Configure here.

return sorted(literals, key=lambda value: (-len(value), value.casefold()))

def slm_entity_lines(self) -> list[str]:
return [f"- {entry.canonical} [{entry.category}]" for entry in self.entries if entry.protect_from_split]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused methods and attribute never called or tested

Low Severity

grammar_literals and slm_entity_lines are defined but have zero callers anywhere in the codebase (not in the export script, not in tests, not imported by other modules). Similarly, self.by_canonical is computed at construction time but never read. These add untested surface area and maintenance burden without current utility—the PR only delivers VoiceLayer snapshot and GBNF exports.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 61c2d43. Configure here.

@EtanHey EtanHey merged commit 158c963 into main Apr 30, 2026
7 checks passed
@EtanHey EtanHey deleted the feat/lexical-defense-dictionary branch April 30, 2026 06:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant