feat: BMPM Hebrew alias resolution + seed aliases (P2b) by EtanHey · Pull Request #204 · EtanHey/brainlayer

EtanHey · 2026-04-04T23:43:59Z

Summary

add BMPM-based phonetic utilities and seed known plus phonetic aliases into kg_entity_aliases
resolve entities through exact name, exact alias, then stored phonetic aliases in KG and MCP lookup
extend the prompt hook to detect Hebrew aliases and phonetic matches, with tests for Hebrew/English resolution and latency

Verification

ruff check pyproject.toml src/brainlayer/phonetic.py scripts/__init__.py scripts/seed_aliases.py src/brainlayer/kg_repo.py src/brainlayer/pipeline/digest.py src/brainlayer/vector_store.py hooks/brainlayer-prompt-search.py tests/test_hebrew_alias.py
pytest -q tests/test_hebrew_alias.py
pytest -q tests/test_entity_dedup.py
pytest -q tests/test_conditional_hooks.py
pytest -q tests/test_search_quality.py -k 'entity_detection or entity_routing or kg_hybrid_search_returns_entity_linked_chunks'
pytest -q tests/test_phase3_digest.py -k 'brain_entity or entity_lookup'

Notes

ruff check src/ hooks/ scripts/ currently reports many pre-existing issues in unrelated scripts, so lint verification was scoped to the touched files.
Please request @codex review, @cursor, and @bugbot on this PR per repo workflow.

Note

Add BMPM Hebrew phonetic alias resolution to entity detection and seed alias utilities

Adds phonetic.py with Beider-Morse (BMPM) utilities: looks_hebrew, phonetic_key, phonetic_tokens, and phonetic_match for cross-script name matching.
Extends entity detection in brainlayer-prompt-search.py to extract Hebrew-script tokens as candidates and resolve them via kg_entity_aliases, falling back to phonetic token overlap scoring.
Adds get_entity_by_phonetic_alias to kg_repo.py to query and score alias candidates by Jaccard overlap of BMPM token sets.
Adds scripts/seed_aliases.py to populate kg_entity_aliases with predefined and computed phonetic aliases for all entities in a given SQLite DB.
Changes entity_lookup in digest.py to prefer resolve_entity (including alias resolution) before falling back to FTS or embedding search.
Adds indexes on kg_entity_aliases(alias COLLATE NOCASE) and kg_entity_aliases(alias_type) to support efficient alias lookups.

^{Macroscope summarized 33512e8.}

Summary by CodeRabbit

New Features
- Enhanced entity resolution with support for Hebrew characters and phonetic matching.
- Added intelligent alias matching using phonetic similarity for improved entity lookups.
- Implemented multi-stage entity resolution fallback mechanism for more robust matching.
Tests
- Added comprehensive test coverage for Hebrew alias resolution and phonetic matching performance.

greptile-apps

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

EtanHey · 2026-04-04T23:44:09Z

Requesting review from @codex review, @cursor, and @BugBot per BrainLayer PR workflow.

chatgpt-codex-connector · 2026-04-04T23:44:11Z

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.

coderabbitai · 2026-04-04T23:44:14Z

📝 Walkthrough

Walkthrough

This PR introduces phonetic and cross-script entity resolution, particularly for Hebrew entities. It adds phonetic matching utilities, a seeding script to populate alias mappings, and integrates phonetic-aware entity resolution throughout the knowledge graph lookup pipeline.

Changes

Cohort / File(s)	Summary
Phonetic Foundation `src/brainlayer/phonetic.py`, `pyproject.toml`	New phonetic matching module using Abydos Beider-Morse engine; detects Hebrew content, normalizes input via NFD diacritical stripping and transliteration, and provides functions to compute phonetic keys, token sets, and similarity matching. Adds `abydos>=0.5.0` runtime dependency.
KG Resolution & Storage `src/brainlayer/kg_repo.py`, `src/brainlayer/vector_store.py`	Enhanced entity resolution with `get_entity_by_phonetic_alias()` method and token overlap scoring; reordered `resolve_entity()` lookup steps to check phonetic aliases before FTS5 fuzzy fallback. Added case-insensitive and alias-type indexes on `kg_entity_aliases` table.
Alias Seeding & Setup `scripts/seed_aliases.py`, `scripts/__init__.py`	New seed script that populates `kg_entity_aliases` with predefined mappings and derives phonetic alias keys from entity names. Exports `seed_aliases()` function returning count of inserted rows. Added package docstring to `scripts/__init__.py`.
Hook & Pipeline Integration `hooks/brainlayer-prompt-search.py`, `src/brainlayer/pipeline/digest.py`	Extended `detect_entities_in_prompt()` to extract and check Hebrew candidates using compiled regex and phonetic matching against alias table; `entity_lookup()` now attempts resolution via `resolve_entity()` before falling back to FTS/semantic search.
Test Coverage `tests/test_hebrew_alias.py`	Comprehensive test suite validating phonetic key generation, Hebrew alias resolution paths (direct, seeded, phonetic), hook integration with Hebrew prompts, and performance baseline (<5ms per resolve call).

Sequence Diagram

sequenceDiagram
    participant Hook as Hook<br/>(detect_entities)
    participant Digest as Pipeline<br/>(entity_lookup)
    participant KGRepo as KG Repo<br/>(resolve_entity)
    participant Phonetic as Phonetic<br/>Engine
    participant Store as Vector Store<br/>(kg_entity_aliases)
    
    Hook->>Hook: Extract Hebrew candidates<br/>via regex
    Hook->>KGRepo: resolve_entity(candidate)
    KGRepo->>KGRepo: Check exact name match
    KGRepo->>Store: Query kg_entity_aliases<br/>by name
    KGRepo->>KGRepo: No direct match found
    KGRepo->>KGRepo: Check if looks_hebrew(candidate)
    alt Hebrew detected
        KGRepo->>Store: Query phonetic aliases
        KGRepo->>Phonetic: phonetic_tokens(candidate)
        KGRepo->>Phonetic: phonetic_tokens(stored alias)
        KGRepo->>KGRepo: Compute token overlap<br/>& Jaccard similarity
        KGRepo->>KGRepo: Select best match
    end
    KGRepo-->>Hook: Return entity or nil
    Hook->>Digest: Pass entities
    Digest->>KGRepo: resolve_entity(query)
    KGRepo-->>Digest: Return or fallback<br/>to FTS/semantic

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~65 minutes

Possibly related PRs

fix: restore hook keyword extraction and safer readonly access #198: Modifies detect_entities_in_prompt() and keyword extraction logic in the same hook file, directly related to entity detection behavior.
feat: Phase 2 — Entity extraction pipeline with bilingual NER #31: Addresses kg_entity_aliases and Hebrew-language entity handling patterns that align with this phonetic alias resolution feature.

Poem

🐰 A rabbit hops through scripts and phonetic keys,
Hebrew whispers dance on matching breeze,
Aliases bloom from seeds we plant so deep,
Cross-script resolution—promises to keep! ✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 58.62% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly summarizes the main changes: adding BMPM-based phonetic Hebrew alias resolution and seeding aliases into the knowledge graph.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feat/p2b-hebrew-alias

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

chatgpt-codex-connector · 2026-04-04T23:44:17Z

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.

macroscopeapp · 2026-04-04T23:47:05Z

+def phonetic_match(name1: str, name2: str, threshold: float = 0.7) -> bool:
+    """Check if two names match phonetically."""
+    set1 = phonetic_tokens(name1)
+    set2 = phonetic_tokens(name2)
+    if not set1 or not set2:
+        return False
+    if set1 & set2:
+        return True
+    similarity = len(set1 & set2) / len(set1 | set2)
+    return similarity >= threshold


🟡 Medium brainlayer/phonetic.py:84

The similarity calculation on line 92 always evaluates to 0 because the early return on lines 90-91 already handles all cases where set1 & set2 is non-empty. If execution reaches line 92, the intersection is empty, so len(set1 & set2) is 0 and similarity is always 0 — the threshold parameter has no effect. The function can only return True for exact token overlap, never for fuzzy matching above a threshold.

- if set1 & set2: - return True - similarity = len(set1 & set2) / len(set1 | set2) - return similarity >= threshold

🚀 Reply "fix it for me" or copy this AI Prompt for your agent:

In file src/brainlayer/phonetic.py around lines 84-93: The `similarity` calculation on line 92 always evaluates to 0 because the early return on lines 90-91 already handles all cases where `set1 & set2` is non-empty. If execution reaches line 92, the intersection is empty, so `len(set1 & set2)` is 0 and `similarity` is always 0 — the `threshold` parameter has no effect. The function can only return `True` for exact token overlap, never for fuzzy matching above a threshold. Evidence trail: src/brainlayer/phonetic.py lines 84-94 at REVIEWED_COMMIT: The `phonetic_match` function has lines 90-91 `if set1 & set2: return True` which returns early when intersection is non-empty. Line 92 `similarity = len(set1 & set2) / len(set1 | set2)` is only reached when the intersection is empty, making `len(set1 & set2)` always 0, so `similarity` is always 0. The `threshold` parameter is ineffective.

macroscopeapp · 2026-04-04T23:47:05Z

+]
+
+
+def _iter_entities(store: VectorStore) -> list[tuple[str, str, str]]:


🟢 Low scripts/seed_aliases.py:17

_iter_entities returns list[tuple[str, str, str]] but canonical_name can be NULL in the database (added via ALTER TABLE without NOT NULL constraint), so the actual runtime type is list[tuple[str, str, str | None]]. This causes type checkers to miss that canonical_name may be None, misleading downstream code about what values to expect. Consider updating the return type hint to list[tuple[str, str, str | None]].

Suggested change

def _iter_entities(store: VectorStore) -> list[tuple[str, str, str]]:

def _iter_entities(store: VectorStore) -> list[tuple[str, str, str | None]]:

🚀 Reply "fix it for me" or copy this AI Prompt for your agent:

In file scripts/seed_aliases.py around line 17: `_iter_entities` returns `list[tuple[str, str, str]]` but `canonical_name` can be `NULL` in the database (added via `ALTER TABLE` without `NOT NULL` constraint), so the actual runtime type is `list[tuple[str, str, str | None]]`. This causes type checkers to miss that `canonical_name` may be `None`, misleading downstream code about what values to expect. Consider updating the return type hint to `list[tuple[str, str, str | None]]`. Evidence trail: scripts/seed_aliases.py:17 - return type is `list[tuple[str, str, str]]` src/brainlayer/vector_store.py:590-597 - `canonical_name` column added via `ALTER TABLE ... ADD COLUMN canonical_name TEXT` (no NOT NULL constraint) scripts/seed_aliases.py:47 - code uses `canonical_name or ""` pattern, acknowledging it can be None

coderabbitai

Actionable comments posted: 7

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@hooks/brainlayer-prompt-search.py`:
- Around line 400-427: The current phonetic matching does a full table scan by
calling conn.execute(...) to load phonetic_rows and scoring in Python; change
this to push token filtering and bounding into SQL and/or cache the table:
compute query_tokens via phonetic_tokens(candidate) as you already do, then
build a parameterized SQL WHERE clause that pre-filters aliases using alias LIKE
'%token%' OR alias LIKE '%token2%' for the tokens and add a LIMIT (e.g., LIMIT
100) to the query executed via conn.execute to reduce rows returned (keep the
same selected columns e.id, e.name, e.entity_type, a.alias), and optionally
implement a small in-memory cache keyed by phonetic_key() to reuse the phonetic
alias set (update whatever cache-get/refresh helper you add in this module) so
that phonetic_rows is fetched from cache instead of re-querying on every call;
keep the downstream scoring logic (best_row/best_score) the same but operate
only on the pre-filtered/limited result set.

In `@scripts/seed_aliases.py`:
- Around line 41-54: The current loop uses get_entity_aliases() before/after
each add_entity_alias call (in the block that also uses phonetic_key and
_iter_entities), causing two full reads per insert and racey inserted counts;
instead, change the logic in the loops that call get_entity_aliases(),
add_entity_alias and phonetic_key to either (A) rely on the DB/client insert
result (e.g., rowcount/changes returned by add_entity_alias) to increment
inserted, or (B) preload each entity's alias set once into memory before inserts
and update that in-memory set when add_entity_alias succeeds so you only check
membership in-memory (and increment inserted when you actually add a new alias);
update references to get_entity_aliases, add_entity_alias, phonetic_key,
_iter_entities and the inserted counter accordingly.
- Around line 17-27: The _iter_entities function annotation incorrectly claims
canonical_name is always a str; update its return type to allow None (e.g.,
change list[tuple[str, str, str]] to list[tuple[str, str, Optional[str]]]) and
add/import Optional from typing, keeping the function body unchanged (this
reflects that kg_entities.canonical_name can be NULL and matches downstream
handling of canonical_name).
- Around line 35-40: The dict comprehension building entities_by_name from
_iter_entities silently drops duplicate names (since schema is
UNIQUE(entity_type, name) not name), so detect and handle duplicates: build a
map name -> list(entity_id) from _iter_entities, then if any name maps to
multiple entity_ids either (a) raise an explicit error with the conflicting
name(s) and entity_ids so the seed run fails early, or (b) require KNOWN_ALIASES
entries to specify entity_type and filter the list by that type when resolving
entity_id; update the resolution logic that currently uses
entities_by_name.get(entity_name) to consult the list and disambiguate by
entity_type (or fail) to avoid seeding aliases to the wrong entity.

In `@src/brainlayer/kg_repo.py`:
- Around line 504-564: get_entity_by_phonetic_alias is doing a full table scan
of kg_entity_aliases/kg_entities and scoring in Python which is O(n); change it
to push cheap filtering into SQL (use phonetic_key/phonetic_tokens to take the
first token or token prefixes and add a WHERE clause like a.alias LIKE ? or
token_prefix match and include a LIMIT to bound rows), and extract the
overlap/score computation into a shared helper (e.g.,
phonetic_score(query_tokens, alias_string)) used by get_entity_by_phonetic_alias
and the prompt-search hook so scoring logic is centralized and consistent;
ensure you still treat exact key matches (phonetic_key) as score=1.0 and decode
metadata with json.loads as before.

In `@src/brainlayer/phonetic.py`:
- Around line 84-93: phonetic_match currently returns True on any token overlap
and never uses threshold; change the logic in phonetic_match (which calls
phonetic_tokens) to first handle empty sets, then compute similarity = len(set1
& set2) / len(set1 | set2) and return similarity >= threshold (removing the
unconditional if set1 & set2: return True), or alternatively remove the
threshold parameter if you intend any overlap to match; update phonetic_match
accordingly so the threshold actually controls matching.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 0a6a74d9-208a-4293-af06-f302ca232441

📥 Commits

Reviewing files that changed from the base of the PR and between 7984e35 and cc7c5cc.

📒 Files selected for processing (9)

hooks/brainlayer-prompt-search.py
pyproject.toml
scripts/__init__.py
scripts/seed_aliases.py
src/brainlayer/kg_repo.py
src/brainlayer/phonetic.py
src/brainlayer/pipeline/digest.py
src/brainlayer/vector_store.py
tests/test_hebrew_alias.py

📜 Review details

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Macroscope - Correctness Check

🧰 Additional context used

📓 Path-based instructions (4)

**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

**/*.py: Flag risky DB or concurrency changes explicitly and do not hand-wave lock behavior
Enforce one-write-at-a-time concurrency constraint; reads are safe but brain_digest is write-heavy and must not run in parallel with other MCP work
Run pytest before claiming behavior changed safely; current test suite has 929 tests

Files:

scripts/__init__.py
src/brainlayer/vector_store.py
src/brainlayer/pipeline/digest.py
hooks/brainlayer-prompt-search.py
tests/test_hebrew_alias.py
src/brainlayer/phonetic.py
src/brainlayer/kg_repo.py
scripts/seed_aliases.py

src/brainlayer/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

src/brainlayer/**/*.py: Use Python/Typer CLI architecture for the main package in src/brainlayer/
All scripts and CLI must use paths.py:get_db_path() for resolving database path instead of hardcoding
Implement chunk lifecycle management with columns superseded_by, aggregated_into, archived_at; default search excludes lifecycle-managed chunks
Never run bulk database operations while enrichment workers are writing; always stop workers and checkpoint WAL first
Drop FTS triggers before bulk deletes on chunks table and recreate after; batch deletes in 5-10K chunks with checkpoint every 3 batches
Implement retry logic on SQLITE_BUSY errors; each worker must use its own database connection
Use ruff check src/ && ruff format src/ for linting and formatting

Files:

src/brainlayer/vector_store.py
src/brainlayer/pipeline/digest.py
src/brainlayer/phonetic.py
src/brainlayer/kg_repo.py

src/brainlayer/vector_store.py

📄 CodeRabbit inference engine (CLAUDE.md)

Use sqlite-vec storage via APSW in vector_store.py for vector operations

Files:

src/brainlayer/vector_store.py

tests/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Use pytest for testing

Files:

tests/test_hebrew_alias.py

🧠 Learnings (7)

📓 Common learnings

Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-03-14T02:20:54.656Z
Learning: Request codex review, cursor review, and bugbot review for BrainLayer PRs

📚 Learning: 2026-03-29T23:19:51.321Z

Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-29T23:19:51.321Z
Learning: Applies to src/brainlayer/vector_store.py : Use sqlite-vec storage via APSW in `vector_store.py` for vector operations

Applied to files:

src/brainlayer/vector_store.py

📚 Learning: 2026-03-29T23:19:50.743Z

Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-29T23:19:50.743Z
Learning: Applies to src/brainlayer/{vector_store,search}*.py : Chunk lifecycle: implement columns `superseded_by`, `aggregated_into`, `archived_at` on chunks table; exclude lifecycle-managed chunks from default search

Applied to files:

src/brainlayer/vector_store.py

📚 Learning: 2026-03-29T23:19:50.743Z

Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-29T23:19:50.743Z
Learning: Applies to src/brainlayer/vector_store.py : Use sqlite-vec with APSW for vector storage and retrieval

Applied to files:

src/brainlayer/vector_store.py

📚 Learning: 2026-04-04T15:21:36.497Z

Learnt from: EtanHey
Repo: EtanHey/brainlayer PR: 198
File: hooks/brainlayer-prompt-search.py:169-169
Timestamp: 2026-04-04T15:21:36.497Z
Learning: In EtanHey/brainlayer, `hooks/brainlayer-prompt-search.py` reads `entity_type` directly from existing rows in `kg_entities` (read-only). `contracts/entity-types.yaml` defines the write-side schema only and is not authoritative for what `entity_type` values exist in the DB. The DB already stores `technology` (72 entities), `project` (24), and `tool` (1) as valid `entity_type` values, so `INJECT_TYPES` in the hook should match these DB values, not the contract file.

Applied to files:

hooks/brainlayer-prompt-search.py

📚 Learning: 2026-04-04T15:21:59.853Z

Learnt from: EtanHey
Repo: EtanHey/brainlayer PR: 198
File: hooks/brainlayer-prompt-search.py:241-259
Timestamp: 2026-04-04T15:21:59.853Z
Learning: In `hooks/brainlayer-prompt-search.py` (Python), `record_injection_event()` is explicitly best-effort telemetry: silent `except sqlite3.Error: pass` is intentional — table non-existence or lock failures are acceptable silent failures. `sqlite3.connect(timeout=2)` is the file-open timeout; `PRAGMA busy_timeout` governs per-statement lock-wait. The `DEADLINE_MS` (450ms) guard applies only to the FTS search phase, not to this side-channel write.

Applied to files:

hooks/brainlayer-prompt-search.py

📚 Learning: 2026-03-17T01:04:22.497Z

Learnt from: EtanHey
Repo: EtanHey/brainlayer PR: 0
File: :0-0
Timestamp: 2026-03-17T01:04:22.497Z
Learning: Applies to src/brainlayer/mcp/**/*.py and brain-bar/Sources/BrainBar/MCPRouter.swift: The 8 required MCP tools are `brain_search`, `brain_store`, `brain_recall`, `brain_entity`, `brain_expand`, `brain_update`, `brain_digest`, `brain_tags`. `brain_tags` is the 8th tool, replacing `brain_get_person`, as defined in the Phase B spec merged in PR `#72`. The Python MCP server already implements `brain_tags`. Legacy `brainlayer_*` aliases must be maintained for backward compatibility.

Applied to files:

hooks/brainlayer-prompt-search.py

🔇 Additional comments (11)

scripts/__init__.py (1)

1-1: LGTM!

Standard package initialization to make scripts/ importable.

src/brainlayer/vector_store.py (1)

585-586: LGTM!

Good addition of indexes to support the new alias resolution flow:

COLLATE NOCASE on alias enables efficient case-insensitive lookups

Index on alias_type speeds up filtering for phonetic alias queries

These align with the query patterns in kg_repo.py and brainlayer-prompt-search.py.

src/brainlayer/pipeline/digest.py (1)

806-819: LGTM!

The updated resolution order is sound:

Try resolve_entity() for exact/alias/phonetic match

Only fall back to FTS/semantic if no match or type mismatch

The short-circuit preserves downstream behavior (relations, evidence, health lookup)

This correctly prioritizes deterministic resolution before fuzzy search.

hooks/brainlayer-prompt-search.py (2)

370-372: LGTM!

Good addition of Hebrew token extraction into the candidates list for entity detection.

379-399: LGTM!

Proper defensive check for kg_entity_aliases table existence and clean fallback to alias lookup when exact name match fails.

tests/test_hebrew_alias.py (1)

1-165: LGTM!

Comprehensive test coverage for the Hebrew alias resolution feature:

Unit tests for phonetic utilities

Integration tests for alias resolution paths

Hook behavior validation

Performance regression test (5ms threshold)

Good test hygiene with fresh DB fixtures per test.

src/brainlayer/kg_repo.py (1)

702-776: LGTM on resolution order logic.

The updated resolve_entity order is sensible:

Exact name (case-insensitive) — deterministic, highest priority

Exact alias — handles registered aliases

Canonical name — normalized lookup

Phonetic alias — cross-script fallback

FTS5 fuzzy — last resort

The exact name step running before aliases addresses the concern from the context snippets — if a string is both an alias for Entity A and the canonical name of Entity B, Entity B wins, which is the correct behavior for entity extraction pipelines.

src/brainlayer/phonetic.py (2)

12-47: LGTM!

Well-structured Hebrew transliteration mappings:

Character map covers standard Hebrew alphabet including final forms

Sequence map handles common digraphs first (e.g., "היי" → "hei")

Order matters: sequence replacement before character replacement

10-10: Parameters are valid and use documented defaults.

name_mode="gen" and match_mode="approx" are both valid and are the default values for BeiderMorse according to the abydos library documentation. No changes needed.

pyproject.toml (1)

39-39: Dependency constraint is valid.

The abydos>=0.5.0 constraint is available on PyPI and references the current latest version (0.5.0). The dependency is appropriately documented with its purpose.

scripts/seed_aliases.py (1)

55-56: Good resource cleanup.

Using finally to close the store is solid and prevents leaked DB handles on exceptions.

coderabbitai · 2026-04-04T23:49:53Z

+            if not rows and alias_table_exists and looks_hebrew(candidate):
+                query_tokens = phonetic_tokens(candidate)
+                if query_tokens:
+                    phonetic_rows = conn.execute(
+                        """
+                        SELECT e.id, e.name, e.entity_type, a.alias
+                        FROM kg_entity_aliases a
+                        JOIN kg_entities e ON a.entity_id = e.id
+                        WHERE a.alias_type = 'phonetic'
+                        """
+                    ).fetchall()
+                    query_key = phonetic_key(candidate)
+                    best_row = None
+                    best_score = 0.0
+                    for row in phonetic_rows:
+                        alias_tokens = {token for token in str(row[3]).split() if token}
+                        if not alias_tokens:
+                            continue
+                        overlap = query_tokens & alias_tokens
+                        if not overlap:
+                            continue
+                        score = len(overlap) / len(query_tokens | alias_tokens)
+                        if row[3] == query_key:
+                            score = 1.0
+                        if score > best_score:
+                            best_score = score
+                            best_row = row
+                    rows = [best_row[:3]] if best_row else []


🧹 Nitpick | 🔵 Trivial

Performance concern: full table scan for phonetic alias matching.

For each Hebrew candidate, this fetches all phonetic alias rows from kg_entity_aliases and scores them in Python. With many entities, this becomes O(n) per candidate.

Consider adding a SQL-side filter or limiting the fetch:

Pre-filter by partial token match using LIKE on the alias column

Add a LIMIT clause to bound worst-case

Cache the phonetic alias table if it's small and stable

♻️ Proposed optimization sketch

- phonetic_rows = conn.execute( - """ - SELECT e.id, e.name, e.entity_type, a.alias - FROM kg_entity_aliases a - JOIN kg_entities e ON a.entity_id = e.id - WHERE a.alias_type = 'phonetic' - """ - ).fetchall() + # Use first token for SQL pre-filter to reduce scan size + first_token = next(iter(query_tokens), "") + phonetic_rows = conn.execute( + """ + SELECT e.id, e.name, e.entity_type, a.alias + FROM kg_entity_aliases a + JOIN kg_entities e ON a.entity_id = e.id + WHERE a.alias_type = 'phonetic' + AND (a.alias LIKE '%' || ? || '%' OR ? = '') + LIMIT 100 + """, + (first_token, first_token), + ).fetchall()

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@hooks/brainlayer-prompt-search.py` around lines 400 - 427, The current phonetic matching does a full table scan by calling conn.execute(...) to load phonetic_rows and scoring in Python; change this to push token filtering and bounding into SQL and/or cache the table: compute query_tokens via phonetic_tokens(candidate) as you already do, then build a parameterized SQL WHERE clause that pre-filters aliases using alias LIKE '%token%' OR alias LIKE '%token2%' for the tokens and add a LIMIT (e.g., LIMIT 100) to the query executed via conn.execute to reduce rows returned (keep the same selected columns e.id, e.name, e.entity_type, a.alias), and optionally implement a small in-memory cache keyed by phonetic_key() to reuse the phonetic alias set (update whatever cache-get/refresh helper you add in this module) so that phonetic_rows is fetched from cache instead of re-querying on every call; keep the downstream scoring logic (best_row/best_score) the same but operate only on the pre-filtered/limited result set.

coderabbitai · 2026-04-04T23:49:53Z

+def _iter_entities(store: VectorStore) -> list[tuple[str, str, str]]:
+    """Return entity id, name, and canonical name for all KG entities."""
+    return list(
+        store._read_cursor().execute(
+            """
+            SELECT id, name, canonical_name
+            FROM kg_entities
+            ORDER BY name
+            """
+        )
+    )


⚠️ Potential issue | 🟡 Minor

Fix _iter_entities return typing for nullable canonical_name.

Line 47 already handles canonical_name as nullable, but Line 17 annotates it as always str. Please make the tuple type reflect None possibility.

Proposed fix

-def _iter_entities(store: VectorStore) -> list[tuple[str, str, str]]: +def _iter_entities(store: VectorStore) -> list[tuple[str, str, str | None]]:

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@scripts/seed_aliases.py` around lines 17 - 27, The _iter_entities function annotation incorrectly claims canonical_name is always a str; update its return type to allow None (e.g., change list[tuple[str, str, str]] to list[tuple[str, str, Optional[str]]]) and add/import Optional from typing, keeping the function body unchanged (this reflects that kg_entities.canonical_name can be NULL and matches downstream handling of canonical_name).

coderabbitai · 2026-04-04T23:49:53Z

+def seed_aliases(db_path: str | Path) -> int:
+    """Populate known aliases and BMPM phonetic aliases. Returns rows inserted."""
+    store = VectorStore(Path(db_path))
+    inserted = 0
+    try:
+        entities_by_name = {name: entity_id for entity_id, name, _canonical_name in _iter_entities(store)}
+
+        for alias, entity_name, alias_type in KNOWN_ALIASES:
+            entity_id = entities_by_name.get(entity_name)
+            if entity_id is None:
+                continue
+            before = len(store.get_entity_aliases(entity_id))
+            store.add_entity_alias(alias, entity_id, alias_type=alias_type)
+            after = len(store.get_entity_aliases(entity_id))
+            inserted += max(0, after - before)
+
+        for entity_id, name, canonical_name in _iter_entities(store):
+            for candidate in {name, canonical_name or ""}:
+                key = phonetic_key(candidate)
+                if not key:
+                    continue
+                before = len(store.get_entity_aliases(entity_id))
+                store.add_entity_alias(key, entity_id, alias_type="phonetic")
+                after = len(store.get_entity_aliases(entity_id))
+                inserted += max(0, after - before)
+    finally:


⚠️ Potential issue | 🔴 Critical

Serialize this entire seeding path; writes are currently unguarded.

Lines 42 and 52 perform write operations in loops, but this script does not enforce a one-writer-at-a-time guard. That risks contention with other MCP write-heavy flows (e.g., digest) and violates the repo’s write-serialization rule.

As per coding guidelines, "Enforce one-write-at-a-time concurrency constraint; reads are safe but brain_digest is write-heavy and must not run in parallel with other MCP work."

coderabbitai · 2026-04-04T23:49:53Z

+        entities_by_name = {name: entity_id for entity_id, name, _canonical_name in _iter_entities(store)}
+
+        for alias, entity_name, alias_type in KNOWN_ALIASES:
+            entity_id = entities_by_name.get(entity_name)
+            if entity_id is None:
+                continue


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash set -euo pipefail # 1) Check whether kg_entities.name is uniquely constrained in schema/migrations. rg -nP 'CREATE TABLE\s+kg_entities|UNIQUE\s*\(\s*name\s*\)|name\s+\w+\s+UNIQUE' -C4 # 2) Inspect insert/write paths for kg_entities to confirm duplicate names cannot be created. rg -nP 'INSERT\s+INTO\s+kg_entities|add_entity\s*\(' -C3 --type=py

Repository: EtanHey/brainlayer

Length of output: 9855

🏁 Script executed:

fd -type f -name "*.py" | grep -E "(migration|schema|init.*db)" | head -20

Repository: EtanHey/brainlayer

Length of output: 232

🏁 Script executed:

find . -type f -name "*.sql" | head -20

Repository: EtanHey/brainlayer

Length of output: 44

🏁 Script executed:

cat -n scripts/seed_aliases.py | head -50

Repository: EtanHey/brainlayer

Length of output: 2070

🏁 Script executed:

rg -n "kg_entities" src/brainlayer/vector_store.py -A 20 | grep -E "CREATE TABLE|UNIQUE|CONSTRAINT" | head -20

Repository: EtanHey/brainlayer

Length of output: 349

🏁 Script executed:

rg -n "entity_type" scripts/seed_aliases.py -B 5 -A 5

Repository: EtanHey/brainlayer

Length of output: 44

🏁 Script executed:

rg -n "INJECT_TYPES|entity_type" hooks/brainlayer-prompt-search.py -B 2 -A 2 | head -40

Repository: EtanHey/brainlayer

Length of output: 1714

🏁 Script executed:

cat -n src/brainlayer/vector_store.py | sed -n '491,510p'

Repository: EtanHey/brainlayer

Length of output: 1201

🏁 Script executed:

# Check if there are any references to duplicate handling or validation for names rg -n "duplicate.*name|name.*duplicate" --type py

Repository: EtanHey/brainlayer

Length of output: 620

🏁 Script executed:

cat -n scripts/dedup_entities.py | head -50

Repository: EtanHey/brainlayer

Length of output: 2016

🏁 Script executed:

# Check if there's any test or comment about duplicate names across entity types rg -n "same.*name.*different.*type|entity_type.*name" -B 3 -A 3 scripts/seed_aliases.py

Repository: EtanHey/brainlayer

Length of output: 44

🏁 Script executed:

# Check if KNOWN_ALIASES entities actually exist and if there are duplicates rg -n "Etan Heyman" --type py -B 2 -A 2

Repository: EtanHey/brainlayer

Length of output: 36166

🏁 Script executed:

cat -n scripts/seed_aliases.py | sed -n '30,60p'

Repository: EtanHey/brainlayer

Length of output: 1555

🏁 Script executed:

# Check if there's any entity_type handling or filtering in seed_aliases rg -n "entity_type" scripts/seed_aliases.py -B 3 -A 3

Repository: EtanHey/brainlayer

Length of output: 44

🏁 Script executed:

# Search for any tests of seed_aliases with duplicate names rg -n "seed_aliases|seed.*alias" tests/ --type py -B 2 -A 5 | head -50

Repository: EtanHey/brainlayer

Length of output: 2640

Verify name → entity_id mapping for duplicate names across entity types.

Line 35 creates a dict that silently overwrites if kg_entities.name appears more than once. The schema constraint is UNIQUE(entity_type, name) (not on name alone), so duplicate names are possible across different entity types—for example, "Atan Heyman" could exist as both person and project. If such duplicates exist in the database, the dict comprehension keeps only the last entity_id seen, and subsequent alias seeds could target the wrong entity.

Guard against this by either:

Checking that duplicate names do not exist, or

Filtering/validating entity_type before seeding aliases, or

Updating the schema to constrain name uniqueness globally (if appropriate for your model).

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@scripts/seed_aliases.py` around lines 35 - 40, The dict comprehension building entities_by_name from _iter_entities silently drops duplicate names (since schema is UNIQUE(entity_type, name) not name), so detect and handle duplicates: build a map name -> list(entity_id) from _iter_entities, then if any name maps to multiple entity_ids either (a) raise an explicit error with the conflicting name(s) and entity_ids so the seed run fails early, or (b) require KNOWN_ALIASES entries to specify entity_type and filter the list by that type when resolving entity_id; update the resolution logic that currently uses entities_by_name.get(entity_name) to consult the list and disambiguate by entity_type (or fail) to avoid seeding aliases to the wrong entity.

coderabbitai · 2026-04-04T23:49:53Z

+            before = len(store.get_entity_aliases(entity_id))
+            store.add_entity_alias(alias, entity_id, alias_type=alias_type)
+            after = len(store.get_entity_aliases(entity_id))
+            inserted += max(0, after - before)
+
+        for entity_id, name, canonical_name in _iter_entities(store):
+            for candidate in {name, canonical_name or ""}:
+                key = phonetic_key(candidate)
+                if not key:
+                    continue
+                before = len(store.get_entity_aliases(entity_id))
+                store.add_entity_alias(key, entity_id, alias_type="phonetic")
+                after = len(store.get_entity_aliases(entity_id))
+                inserted += max(0, after - before)


⚠️ Potential issue | 🟠 Major

Avoid get_entity_aliases() before/after every insert.

This does two full alias reads per attempted insert, which is expensive and can miscount inserted under concurrent writes. Prefer an insert result signal (e.g., rowcount/changes) or a preloaded alias set updated in-memory during this run.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@scripts/seed_aliases.py` around lines 41 - 54, The current loop uses get_entity_aliases() before/after each add_entity_alias call (in the block that also uses phonetic_key and _iter_entities), causing two full reads per insert and racey inserted counts; instead, change the logic in the loops that call get_entity_aliases(), add_entity_alias and phonetic_key to either (A) rely on the DB/client insert result (e.g., rowcount/changes returned by add_entity_alias) to increment inserted, or (B) preload each entity's alias set once into memory before inserts and update that in-memory set when add_entity_alias succeeds so you only check membership in-memory (and increment inserted when you actually add a new alias); update references to get_entity_aliases, add_entity_alias, phonetic_key, _iter_entities and the inserted counter accordingly.

coderabbitai · 2026-04-04T23:49:53Z

+    def get_entity_by_phonetic_alias(
+        self,
+        name: str,
+        entity_type: Optional[str] = None,
+    ) -> Optional[Dict[str, Any]]:
+        """Look up an entity by stored BMPM phonetic aliases."""
+        query_key = phonetic_key(name)
+        query_tokens = phonetic_tokens(name)
+        if not query_key or not query_tokens:
+            return None
+
+        cursor = self._read_cursor()
+        if entity_type:
+            rows = list(
+                cursor.execute(
+                    """
+                    SELECT e.id, e.entity_type, e.name, e.metadata, e.created_at, e.updated_at, a.alias
+                    FROM kg_entity_aliases a
+                    JOIN kg_entities e ON a.entity_id = e.id
+                    WHERE a.alias_type = 'phonetic' AND e.entity_type = ?
+                    """,
+                    (entity_type,),
+                )
+            )
+        else:
+            rows = list(
+                cursor.execute(
+                    """
+                    SELECT e.id, e.entity_type, e.name, e.metadata, e.created_at, e.updated_at, a.alias
+                    FROM kg_entity_aliases a
+                    JOIN kg_entities e ON a.entity_id = e.id
+                    WHERE a.alias_type = 'phonetic'
+                    """
+                )
+            )
+
+        best_match = None
+        best_score = 0.0
+        for row in rows:
+            alias_tokens = {token for token in str(row[6]).split() if token}
+            if not alias_tokens:
+                continue
+            overlap = query_tokens & alias_tokens
+            if not overlap:
+                continue
+            score = len(overlap) / len(query_tokens | alias_tokens)
+            if row[6] == query_key:
+                score = 1.0
+            if score <= best_score:
+                continue
+            best_score = score
+            best_match = {
+                "id": row[0],
+                "entity_type": row[1],
+                "name": row[2],
+                "metadata": json.loads(row[3]) if row[3] else {},
+                "created_at": row[4],
+                "updated_at": row[5],
+            }
+
+        return best_match


🧹 Nitpick | 🔵 Trivial

Performance concern: full table scan for phonetic alias matching.

get_entity_by_phonetic_alias fetches all rows where alias_type='phonetic' and scores them in Python. This duplicates the same pattern in the prompt-search hook and has the same O(n) concern.

Consider:

Adding SQL-side filtering by a token prefix

Adding a LIMIT clause

Extracting shared scoring logic into a helper to avoid divergence

♻️ Proposed optimization sketch

+ # Pre-filter by first token to reduce scan size + first_token = next(iter(query_tokens), "") if entity_type: rows = list( cursor.execute( """ SELECT e.id, e.entity_type, e.name, e.metadata, e.created_at, e.updated_at, a.alias FROM kg_entity_aliases a JOIN kg_entities e ON a.entity_id = e.id - WHERE a.alias_type = 'phonetic' AND e.entity_type = ? + WHERE a.alias_type = 'phonetic' AND e.entity_type = ? + AND (a.alias LIKE '%' || ? || '%' OR ? = '') + LIMIT 200 """, - (entity_type,), + (entity_type, first_token, first_token), ) )

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/brainlayer/kg_repo.py` around lines 504 - 564, get_entity_by_phonetic_alias is doing a full table scan of kg_entity_aliases/kg_entities and scoring in Python which is O(n); change it to push cheap filtering into SQL (use phonetic_key/phonetic_tokens to take the first token or token prefixes and add a WHERE clause like a.alias LIKE ? or token_prefix match and include a LIMIT to bound rows), and extract the overlap/score computation into a shared helper (e.g., phonetic_score(query_tokens, alias_string)) used by get_entity_by_phonetic_alias and the prompt-search hook so scoring logic is centralized and consistent; ensure you still treat exact key matches (phonetic_key) as score=1.0 and decode metadata with json.loads as before.

coderabbitai · 2026-04-04T23:49:53Z

+def phonetic_match(name1: str, name2: str, threshold: float = 0.7) -> bool:
+    """Check if two names match phonetically."""
+    set1 = phonetic_tokens(name1)
+    set2 = phonetic_tokens(name2)
+    if not set1 or not set2:
+        return False
+    if set1 & set2:
+        return True
+    similarity = len(set1 & set2) / len(set1 | set2)
+    return similarity >= threshold


⚠️ Potential issue | 🟡 Minor

Logic bug: threshold parameter is ineffective.

The function has dead code:

Line 90-91: Returns True if any token overlap exists

Line 92: If we reach this line, set1 & set2 is empty, so similarity is always 0

Line 93: Always returns False since 0 >= threshold is false

The threshold parameter is never actually used for scoring.

🐛 Proposed fix

If the intent is to require a minimum similarity score:

def phonetic_match(name1: str, name2: str, threshold: float = 0.7) -> bool: """Check if two names match phonetically.""" set1 = phonetic_tokens(name1) set2 = phonetic_tokens(name2) if not set1 or not set2: return False - if set1 & set2: - return True similarity = len(set1 & set2) / len(set1 | set2) return similarity >= threshold

Or if any overlap should always match, remove the misleading parameter:

-def phonetic_match(name1: str, name2: str, threshold: float = 0.7) -> bool: +def phonetic_match(name1: str, name2: str) -> bool: """Check if two names match phonetically.""" set1 = phonetic_tokens(name1) set2 = phonetic_tokens(name2) if not set1 or not set2: return False - if set1 & set2: - return True - similarity = len(set1 & set2) / len(set1 | set2) - return similarity >= threshold + return bool(set1 & set2)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def phonetic_match(name1: str, name2: str, threshold: float = 0.7) -> bool:

"""Check if two names match phonetically."""

set1 = phonetic_tokens(name1)

set2 = phonetic_tokens(name2)

if not set1 or not set2:

return False

if set1 & set2:

return True

similarity = len(set1 & set2) / len(set1 | set2)

return similarity >= threshold

def phonetic_match(name1: str, name2: str, threshold: float = 0.7) -> bool:

"""Check if two names match phonetically."""

set1 = phonetic_tokens(name1)

set2 = phonetic_tokens(name2)

if not set1 or not set2:

return False

similarity = len(set1 & set2) / len(set1 | set2)

return similarity >= threshold

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/brainlayer/phonetic.py` around lines 84 - 93, phonetic_match currently returns True on any token overlap and never uses threshold; change the logic in phonetic_match (which calls phonetic_tokens) to first handle empty sets, then compute similarity = len(set1 & set2) / len(set1 | set2) and return similarity >= threshold (removing the unconditional if set1 & set2: return True), or alternatively remove the threshold parameter if you intend any overlap to match; update phonetic_match accordingly so the threshold actually controls matching.

greptile-apps Bot reviewed Apr 4, 2026

View reviewed changes

macroscopeapp Bot reviewed Apr 4, 2026

View reviewed changes

coderabbitai Bot reviewed Apr 4, 2026

View reviewed changes

feat: add BMPM Hebrew alias resolution

33512e8

EtanHey force-pushed the feat/p2b-hebrew-alias branch from cc7c5cc to 33512e8 Compare April 4, 2026 23:58

EtanHey merged commit df9dd88 into main Apr 5, 2026
5 of 6 checks passed

coderabbitai Bot mentioned this pull request Apr 6, 2026

feat: implement cached prompt entity detection #213

Merged

		]


		def _iter_entities(store: VectorStore) -> list[tuple[str, str, str]]:

Conversation

EtanHey commented Apr 4, 2026 • edited by macroscopeapp Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Verification

Notes

Add BMPM Hebrew phonetic alias resolution to entity detection and seed alias utilities

Summary by CodeRabbit

Uh oh!

greptile-apps Bot left a comment

Choose a reason for hiding this comment

Uh oh!

EtanHey commented Apr 4, 2026

Uh oh!

chatgpt-codex-connector Bot commented Apr 4, 2026

Uh oh!

coderabbitai Bot commented Apr 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

chatgpt-codex-connector Bot commented Apr 4, 2026

Uh oh!

macroscopeapp Bot Apr 4, 2026

Choose a reason for hiding this comment

Uh oh!

macroscopeapp Bot Apr 4, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 4, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 4, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 4, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 4, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 4, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 4, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 4, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

EtanHey commented Apr 4, 2026 •

edited by macroscopeapp Bot

Loading

coderabbitai Bot commented Apr 4, 2026 •

edited

Loading