Skip to content

feat: BMPM Hebrew alias resolution + seed aliases (P2b)#204

Merged
EtanHey merged 1 commit intomainfrom
feat/p2b-hebrew-alias
Apr 5, 2026
Merged

feat: BMPM Hebrew alias resolution + seed aliases (P2b)#204
EtanHey merged 1 commit intomainfrom
feat/p2b-hebrew-alias

Conversation

@EtanHey
Copy link
Copy Markdown
Owner

@EtanHey EtanHey commented Apr 4, 2026

Summary

  • add BMPM-based phonetic utilities and seed known plus phonetic aliases into kg_entity_aliases
  • resolve entities through exact name, exact alias, then stored phonetic aliases in KG and MCP lookup
  • extend the prompt hook to detect Hebrew aliases and phonetic matches, with tests for Hebrew/English resolution and latency

Verification

  • ruff check pyproject.toml src/brainlayer/phonetic.py scripts/__init__.py scripts/seed_aliases.py src/brainlayer/kg_repo.py src/brainlayer/pipeline/digest.py src/brainlayer/vector_store.py hooks/brainlayer-prompt-search.py tests/test_hebrew_alias.py
  • pytest -q tests/test_hebrew_alias.py
  • pytest -q tests/test_entity_dedup.py
  • pytest -q tests/test_conditional_hooks.py
  • pytest -q tests/test_search_quality.py -k 'entity_detection or entity_routing or kg_hybrid_search_returns_entity_linked_chunks'
  • pytest -q tests/test_phase3_digest.py -k 'brain_entity or entity_lookup'

Notes

  • ruff check src/ hooks/ scripts/ currently reports many pre-existing issues in unrelated scripts, so lint verification was scoped to the touched files.
  • Please request @codex review, @cursor, and @bugbot on this PR per repo workflow.

Note

Add BMPM Hebrew phonetic alias resolution to entity detection and seed alias utilities

  • Adds phonetic.py with Beider-Morse (BMPM) utilities: looks_hebrew, phonetic_key, phonetic_tokens, and phonetic_match for cross-script name matching.
  • Extends entity detection in brainlayer-prompt-search.py to extract Hebrew-script tokens as candidates and resolve them via kg_entity_aliases, falling back to phonetic token overlap scoring.
  • Adds get_entity_by_phonetic_alias to kg_repo.py to query and score alias candidates by Jaccard overlap of BMPM token sets.
  • Adds scripts/seed_aliases.py to populate kg_entity_aliases with predefined and computed phonetic aliases for all entities in a given SQLite DB.
  • Changes entity_lookup in digest.py to prefer resolve_entity (including alias resolution) before falling back to FTS or embedding search.
  • Adds indexes on kg_entity_aliases(alias COLLATE NOCASE) and kg_entity_aliases(alias_type) to support efficient alias lookups.

Macroscope summarized 33512e8.

Summary by CodeRabbit

  • New Features

    • Enhanced entity resolution with support for Hebrew characters and phonetic matching.
    • Added intelligent alias matching using phonetic similarity for improved entity lookups.
    • Implemented multi-stage entity resolution fallback mechanism for more robust matching.
  • Tests

    • Added comprehensive test coverage for Hebrew alias resolution and phonetic matching performance.

Copy link
Copy Markdown

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

@EtanHey
Copy link
Copy Markdown
Owner Author

EtanHey commented Apr 4, 2026

Requesting review from @codex review, @cursor, and @BugBot per BrainLayer PR workflow.

@chatgpt-codex-connector
Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 4, 2026

📝 Walkthrough

Walkthrough

This PR introduces phonetic and cross-script entity resolution, particularly for Hebrew entities. It adds phonetic matching utilities, a seeding script to populate alias mappings, and integrates phonetic-aware entity resolution throughout the knowledge graph lookup pipeline.

Changes

Cohort / File(s) Summary
Phonetic Foundation
src/brainlayer/phonetic.py, pyproject.toml
New phonetic matching module using Abydos Beider-Morse engine; detects Hebrew content, normalizes input via NFD diacritical stripping and transliteration, and provides functions to compute phonetic keys, token sets, and similarity matching. Adds abydos>=0.5.0 runtime dependency.
KG Resolution & Storage
src/brainlayer/kg_repo.py, src/brainlayer/vector_store.py
Enhanced entity resolution with get_entity_by_phonetic_alias() method and token overlap scoring; reordered resolve_entity() lookup steps to check phonetic aliases before FTS5 fuzzy fallback. Added case-insensitive and alias-type indexes on kg_entity_aliases table.
Alias Seeding & Setup
scripts/seed_aliases.py, scripts/__init__.py
New seed script that populates kg_entity_aliases with predefined mappings and derives phonetic alias keys from entity names. Exports seed_aliases() function returning count of inserted rows. Added package docstring to scripts/__init__.py.
Hook & Pipeline Integration
hooks/brainlayer-prompt-search.py, src/brainlayer/pipeline/digest.py
Extended detect_entities_in_prompt() to extract and check Hebrew candidates using compiled regex and phonetic matching against alias table; entity_lookup() now attempts resolution via resolve_entity() before falling back to FTS/semantic search.
Test Coverage
tests/test_hebrew_alias.py
Comprehensive test suite validating phonetic key generation, Hebrew alias resolution paths (direct, seeded, phonetic), hook integration with Hebrew prompts, and performance baseline (<5ms per resolve call).

Sequence Diagram

sequenceDiagram
    participant Hook as Hook<br/>(detect_entities)
    participant Digest as Pipeline<br/>(entity_lookup)
    participant KGRepo as KG Repo<br/>(resolve_entity)
    participant Phonetic as Phonetic<br/>Engine
    participant Store as Vector Store<br/>(kg_entity_aliases)
    
    Hook->>Hook: Extract Hebrew candidates<br/>via regex
    Hook->>KGRepo: resolve_entity(candidate)
    KGRepo->>KGRepo: Check exact name match
    KGRepo->>Store: Query kg_entity_aliases<br/>by name
    KGRepo->>KGRepo: No direct match found
    KGRepo->>KGRepo: Check if looks_hebrew(candidate)
    alt Hebrew detected
        KGRepo->>Store: Query phonetic aliases
        KGRepo->>Phonetic: phonetic_tokens(candidate)
        KGRepo->>Phonetic: phonetic_tokens(stored alias)
        KGRepo->>KGRepo: Compute token overlap<br/>& Jaccard similarity
        KGRepo->>KGRepo: Select best match
    end
    KGRepo-->>Hook: Return entity or nil
    Hook->>Digest: Pass entities
    Digest->>KGRepo: resolve_entity(query)
    KGRepo-->>Digest: Return or fallback<br/>to FTS/semantic
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~65 minutes

Possibly related PRs

Poem

🐰 A rabbit hops through scripts and phonetic keys,
Hebrew whispers dance on matching breeze,
Aliases bloom from seeds we plant so deep,
Cross-script resolution—promises to keep!

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 58.62% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the main changes: adding BMPM-based phonetic Hebrew alias resolution and seeding aliases into the knowledge graph.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/p2b-hebrew-alias

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@chatgpt-codex-connector
Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.

Comment on lines +84 to +93
def phonetic_match(name1: str, name2: str, threshold: float = 0.7) -> bool:
"""Check if two names match phonetically."""
set1 = phonetic_tokens(name1)
set2 = phonetic_tokens(name2)
if not set1 or not set2:
return False
if set1 & set2:
return True
similarity = len(set1 & set2) / len(set1 | set2)
return similarity >= threshold
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Medium brainlayer/phonetic.py:84

The similarity calculation on line 92 always evaluates to 0 because the early return on lines 90-91 already handles all cases where set1 & set2 is non-empty. If execution reaches line 92, the intersection is empty, so len(set1 & set2) is 0 and similarity is always 0 — the threshold parameter has no effect. The function can only return True for exact token overlap, never for fuzzy matching above a threshold.

-    if set1 & set2:
-        return True
-    similarity = len(set1 & set2) / len(set1 | set2)
-    return similarity >= threshold
🚀 Reply "fix it for me" or copy this AI Prompt for your agent:
In file src/brainlayer/phonetic.py around lines 84-93:

The `similarity` calculation on line 92 always evaluates to 0 because the early return on lines 90-91 already handles all cases where `set1 & set2` is non-empty. If execution reaches line 92, the intersection is empty, so `len(set1 & set2)` is 0 and `similarity` is always 0 — the `threshold` parameter has no effect. The function can only return `True` for exact token overlap, never for fuzzy matching above a threshold.

Evidence trail:
src/brainlayer/phonetic.py lines 84-94 at REVIEWED_COMMIT: The `phonetic_match` function has lines 90-91 `if set1 & set2: return True` which returns early when intersection is non-empty. Line 92 `similarity = len(set1 & set2) / len(set1 | set2)` is only reached when the intersection is empty, making `len(set1 & set2)` always 0, so `similarity` is always 0. The `threshold` parameter is ineffective.

Comment thread scripts/seed_aliases.py
]


def _iter_entities(store: VectorStore) -> list[tuple[str, str, str]]:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 Low scripts/seed_aliases.py:17

_iter_entities returns list[tuple[str, str, str]] but canonical_name can be NULL in the database (added via ALTER TABLE without NOT NULL constraint), so the actual runtime type is list[tuple[str, str, str | None]]. This causes type checkers to miss that canonical_name may be None, misleading downstream code about what values to expect. Consider updating the return type hint to list[tuple[str, str, str | None]].

Suggested change
def _iter_entities(store: VectorStore) -> list[tuple[str, str, str]]:
def _iter_entities(store: VectorStore) -> list[tuple[str, str, str | None]]:
🚀 Reply "fix it for me" or copy this AI Prompt for your agent:
In file scripts/seed_aliases.py around line 17:

`_iter_entities` returns `list[tuple[str, str, str]]` but `canonical_name` can be `NULL` in the database (added via `ALTER TABLE` without `NOT NULL` constraint), so the actual runtime type is `list[tuple[str, str, str | None]]`. This causes type checkers to miss that `canonical_name` may be `None`, misleading downstream code about what values to expect. Consider updating the return type hint to `list[tuple[str, str, str | None]]`.

Evidence trail:
scripts/seed_aliases.py:17 - return type is `list[tuple[str, str, str]]`
src/brainlayer/vector_store.py:590-597 - `canonical_name` column added via `ALTER TABLE ... ADD COLUMN canonical_name TEXT` (no NOT NULL constraint)
scripts/seed_aliases.py:47 - code uses `canonical_name or ""` pattern, acknowledging it can be None

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@hooks/brainlayer-prompt-search.py`:
- Around line 400-427: The current phonetic matching does a full table scan by
calling conn.execute(...) to load phonetic_rows and scoring in Python; change
this to push token filtering and bounding into SQL and/or cache the table:
compute query_tokens via phonetic_tokens(candidate) as you already do, then
build a parameterized SQL WHERE clause that pre-filters aliases using alias LIKE
'%token%' OR alias LIKE '%token2%' for the tokens and add a LIMIT (e.g., LIMIT
100) to the query executed via conn.execute to reduce rows returned (keep the
same selected columns e.id, e.name, e.entity_type, a.alias), and optionally
implement a small in-memory cache keyed by phonetic_key() to reuse the phonetic
alias set (update whatever cache-get/refresh helper you add in this module) so
that phonetic_rows is fetched from cache instead of re-querying on every call;
keep the downstream scoring logic (best_row/best_score) the same but operate
only on the pre-filtered/limited result set.

In `@scripts/seed_aliases.py`:
- Around line 41-54: The current loop uses get_entity_aliases() before/after
each add_entity_alias call (in the block that also uses phonetic_key and
_iter_entities), causing two full reads per insert and racey inserted counts;
instead, change the logic in the loops that call get_entity_aliases(),
add_entity_alias and phonetic_key to either (A) rely on the DB/client insert
result (e.g., rowcount/changes returned by add_entity_alias) to increment
inserted, or (B) preload each entity's alias set once into memory before inserts
and update that in-memory set when add_entity_alias succeeds so you only check
membership in-memory (and increment inserted when you actually add a new alias);
update references to get_entity_aliases, add_entity_alias, phonetic_key,
_iter_entities and the inserted counter accordingly.
- Around line 17-27: The _iter_entities function annotation incorrectly claims
canonical_name is always a str; update its return type to allow None (e.g.,
change list[tuple[str, str, str]] to list[tuple[str, str, Optional[str]]]) and
add/import Optional from typing, keeping the function body unchanged (this
reflects that kg_entities.canonical_name can be NULL and matches downstream
handling of canonical_name).
- Around line 35-40: The dict comprehension building entities_by_name from
_iter_entities silently drops duplicate names (since schema is
UNIQUE(entity_type, name) not name), so detect and handle duplicates: build a
map name -> list(entity_id) from _iter_entities, then if any name maps to
multiple entity_ids either (a) raise an explicit error with the conflicting
name(s) and entity_ids so the seed run fails early, or (b) require KNOWN_ALIASES
entries to specify entity_type and filter the list by that type when resolving
entity_id; update the resolution logic that currently uses
entities_by_name.get(entity_name) to consult the list and disambiguate by
entity_type (or fail) to avoid seeding aliases to the wrong entity.

In `@src/brainlayer/kg_repo.py`:
- Around line 504-564: get_entity_by_phonetic_alias is doing a full table scan
of kg_entity_aliases/kg_entities and scoring in Python which is O(n); change it
to push cheap filtering into SQL (use phonetic_key/phonetic_tokens to take the
first token or token prefixes and add a WHERE clause like a.alias LIKE ? or
token_prefix match and include a LIMIT to bound rows), and extract the
overlap/score computation into a shared helper (e.g.,
phonetic_score(query_tokens, alias_string)) used by get_entity_by_phonetic_alias
and the prompt-search hook so scoring logic is centralized and consistent;
ensure you still treat exact key matches (phonetic_key) as score=1.0 and decode
metadata with json.loads as before.

In `@src/brainlayer/phonetic.py`:
- Around line 84-93: phonetic_match currently returns True on any token overlap
and never uses threshold; change the logic in phonetic_match (which calls
phonetic_tokens) to first handle empty sets, then compute similarity = len(set1
& set2) / len(set1 | set2) and return similarity >= threshold (removing the
unconditional if set1 & set2: return True), or alternatively remove the
threshold parameter if you intend any overlap to match; update phonetic_match
accordingly so the threshold actually controls matching.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 0a6a74d9-208a-4293-af06-f302ca232441

📥 Commits

Reviewing files that changed from the base of the PR and between 7984e35 and cc7c5cc.

📒 Files selected for processing (9)
  • hooks/brainlayer-prompt-search.py
  • pyproject.toml
  • scripts/__init__.py
  • scripts/seed_aliases.py
  • src/brainlayer/kg_repo.py
  • src/brainlayer/phonetic.py
  • src/brainlayer/pipeline/digest.py
  • src/brainlayer/vector_store.py
  • tests/test_hebrew_alias.py
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Macroscope - Correctness Check
🧰 Additional context used
📓 Path-based instructions (4)
**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

**/*.py: Flag risky DB or concurrency changes explicitly and do not hand-wave lock behavior
Enforce one-write-at-a-time concurrency constraint; reads are safe but brain_digest is write-heavy and must not run in parallel with other MCP work
Run pytest before claiming behavior changed safely; current test suite has 929 tests

Files:

  • scripts/__init__.py
  • src/brainlayer/vector_store.py
  • src/brainlayer/pipeline/digest.py
  • hooks/brainlayer-prompt-search.py
  • tests/test_hebrew_alias.py
  • src/brainlayer/phonetic.py
  • src/brainlayer/kg_repo.py
  • scripts/seed_aliases.py
src/brainlayer/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

src/brainlayer/**/*.py: Use Python/Typer CLI architecture for the main package in src/brainlayer/
All scripts and CLI must use paths.py:get_db_path() for resolving database path instead of hardcoding
Implement chunk lifecycle management with columns superseded_by, aggregated_into, archived_at; default search excludes lifecycle-managed chunks
Never run bulk database operations while enrichment workers are writing; always stop workers and checkpoint WAL first
Drop FTS triggers before bulk deletes on chunks table and recreate after; batch deletes in 5-10K chunks with checkpoint every 3 batches
Implement retry logic on SQLITE_BUSY errors; each worker must use its own database connection
Use ruff check src/ && ruff format src/ for linting and formatting

Files:

  • src/brainlayer/vector_store.py
  • src/brainlayer/pipeline/digest.py
  • src/brainlayer/phonetic.py
  • src/brainlayer/kg_repo.py
src/brainlayer/vector_store.py

📄 CodeRabbit inference engine (CLAUDE.md)

Use sqlite-vec storage via APSW in vector_store.py for vector operations

Files:

  • src/brainlayer/vector_store.py
tests/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Use pytest for testing

Files:

  • tests/test_hebrew_alias.py
🧠 Learnings (7)
📓 Common learnings
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-03-14T02:20:54.656Z
Learning: Request codex review, cursor review, and bugbot review for BrainLayer PRs
📚 Learning: 2026-03-29T23:19:51.321Z
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-29T23:19:51.321Z
Learning: Applies to src/brainlayer/vector_store.py : Use sqlite-vec storage via APSW in `vector_store.py` for vector operations

Applied to files:

  • src/brainlayer/vector_store.py
📚 Learning: 2026-03-29T23:19:50.743Z
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-29T23:19:50.743Z
Learning: Applies to src/brainlayer/{vector_store,search}*.py : Chunk lifecycle: implement columns `superseded_by`, `aggregated_into`, `archived_at` on chunks table; exclude lifecycle-managed chunks from default search

Applied to files:

  • src/brainlayer/vector_store.py
📚 Learning: 2026-03-29T23:19:50.743Z
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-29T23:19:50.743Z
Learning: Applies to src/brainlayer/vector_store.py : Use sqlite-vec with APSW for vector storage and retrieval

Applied to files:

  • src/brainlayer/vector_store.py
📚 Learning: 2026-04-04T15:21:36.497Z
Learnt from: EtanHey
Repo: EtanHey/brainlayer PR: 198
File: hooks/brainlayer-prompt-search.py:169-169
Timestamp: 2026-04-04T15:21:36.497Z
Learning: In EtanHey/brainlayer, `hooks/brainlayer-prompt-search.py` reads `entity_type` directly from existing rows in `kg_entities` (read-only). `contracts/entity-types.yaml` defines the write-side schema only and is not authoritative for what `entity_type` values exist in the DB. The DB already stores `technology` (72 entities), `project` (24), and `tool` (1) as valid `entity_type` values, so `INJECT_TYPES` in the hook should match these DB values, not the contract file.

Applied to files:

  • hooks/brainlayer-prompt-search.py
📚 Learning: 2026-04-04T15:21:59.853Z
Learnt from: EtanHey
Repo: EtanHey/brainlayer PR: 198
File: hooks/brainlayer-prompt-search.py:241-259
Timestamp: 2026-04-04T15:21:59.853Z
Learning: In `hooks/brainlayer-prompt-search.py` (Python), `record_injection_event()` is explicitly best-effort telemetry: silent `except sqlite3.Error: pass` is intentional — table non-existence or lock failures are acceptable silent failures. `sqlite3.connect(timeout=2)` is the file-open timeout; `PRAGMA busy_timeout` governs per-statement lock-wait. The `DEADLINE_MS` (450ms) guard applies only to the FTS search phase, not to this side-channel write.

Applied to files:

  • hooks/brainlayer-prompt-search.py
📚 Learning: 2026-03-17T01:04:22.497Z
Learnt from: EtanHey
Repo: EtanHey/brainlayer PR: 0
File: :0-0
Timestamp: 2026-03-17T01:04:22.497Z
Learning: Applies to src/brainlayer/mcp/**/*.py and brain-bar/Sources/BrainBar/MCPRouter.swift: The 8 required MCP tools are `brain_search`, `brain_store`, `brain_recall`, `brain_entity`, `brain_expand`, `brain_update`, `brain_digest`, `brain_tags`. `brain_tags` is the 8th tool, replacing `brain_get_person`, as defined in the Phase B spec merged in PR `#72`. The Python MCP server already implements `brain_tags`. Legacy `brainlayer_*` aliases must be maintained for backward compatibility.

Applied to files:

  • hooks/brainlayer-prompt-search.py
🔇 Additional comments (11)
scripts/__init__.py (1)

1-1: LGTM!

Standard package initialization to make scripts/ importable.

src/brainlayer/vector_store.py (1)

585-586: LGTM!

Good addition of indexes to support the new alias resolution flow:

  • COLLATE NOCASE on alias enables efficient case-insensitive lookups
  • Index on alias_type speeds up filtering for phonetic alias queries

These align with the query patterns in kg_repo.py and brainlayer-prompt-search.py.

src/brainlayer/pipeline/digest.py (1)

806-819: LGTM!

The updated resolution order is sound:

  1. Try resolve_entity() for exact/alias/phonetic match
  2. Only fall back to FTS/semantic if no match or type mismatch
  3. The short-circuit preserves downstream behavior (relations, evidence, health lookup)

This correctly prioritizes deterministic resolution before fuzzy search.

hooks/brainlayer-prompt-search.py (2)

370-372: LGTM!

Good addition of Hebrew token extraction into the candidates list for entity detection.


379-399: LGTM!

Proper defensive check for kg_entity_aliases table existence and clean fallback to alias lookup when exact name match fails.

tests/test_hebrew_alias.py (1)

1-165: LGTM!

Comprehensive test coverage for the Hebrew alias resolution feature:

  • Unit tests for phonetic utilities
  • Integration tests for alias resolution paths
  • Hook behavior validation
  • Performance regression test (5ms threshold)

Good test hygiene with fresh DB fixtures per test.

src/brainlayer/kg_repo.py (1)

702-776: LGTM on resolution order logic.

The updated resolve_entity order is sensible:

  1. Exact name (case-insensitive) — deterministic, highest priority
  2. Exact alias — handles registered aliases
  3. Canonical name — normalized lookup
  4. Phonetic alias — cross-script fallback
  5. FTS5 fuzzy — last resort

The exact name step running before aliases addresses the concern from the context snippets — if a string is both an alias for Entity A and the canonical name of Entity B, Entity B wins, which is the correct behavior for entity extraction pipelines.

src/brainlayer/phonetic.py (2)

12-47: LGTM!

Well-structured Hebrew transliteration mappings:

  • Character map covers standard Hebrew alphabet including final forms
  • Sequence map handles common digraphs first (e.g., "היי" → "hei")
  • Order matters: sequence replacement before character replacement

10-10: Parameters are valid and use documented defaults.

name_mode="gen" and match_mode="approx" are both valid and are the default values for BeiderMorse according to the abydos library documentation. No changes needed.

pyproject.toml (1)

39-39: Dependency constraint is valid.

The abydos>=0.5.0 constraint is available on PyPI and references the current latest version (0.5.0). The dependency is appropriately documented with its purpose.

scripts/seed_aliases.py (1)

55-56: Good resource cleanup.

Using finally to close the store is solid and prevents leaked DB handles on exceptions.

Comment on lines +400 to +427
if not rows and alias_table_exists and looks_hebrew(candidate):
query_tokens = phonetic_tokens(candidate)
if query_tokens:
phonetic_rows = conn.execute(
"""
SELECT e.id, e.name, e.entity_type, a.alias
FROM kg_entity_aliases a
JOIN kg_entities e ON a.entity_id = e.id
WHERE a.alias_type = 'phonetic'
"""
).fetchall()
query_key = phonetic_key(candidate)
best_row = None
best_score = 0.0
for row in phonetic_rows:
alias_tokens = {token for token in str(row[3]).split() if token}
if not alias_tokens:
continue
overlap = query_tokens & alias_tokens
if not overlap:
continue
score = len(overlap) / len(query_tokens | alias_tokens)
if row[3] == query_key:
score = 1.0
if score > best_score:
best_score = score
best_row = row
rows = [best_row[:3]] if best_row else []
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Performance concern: full table scan for phonetic alias matching.

For each Hebrew candidate, this fetches all phonetic alias rows from kg_entity_aliases and scores them in Python. With many entities, this becomes O(n) per candidate.

Consider adding a SQL-side filter or limiting the fetch:

  1. Pre-filter by partial token match using LIKE on the alias column
  2. Add a LIMIT clause to bound worst-case
  3. Cache the phonetic alias table if it's small and stable
♻️ Proposed optimization sketch
-                    phonetic_rows = conn.execute(
-                        """
-                        SELECT e.id, e.name, e.entity_type, a.alias
-                        FROM kg_entity_aliases a
-                        JOIN kg_entities e ON a.entity_id = e.id
-                        WHERE a.alias_type = 'phonetic'
-                        """
-                    ).fetchall()
+                    # Use first token for SQL pre-filter to reduce scan size
+                    first_token = next(iter(query_tokens), "")
+                    phonetic_rows = conn.execute(
+                        """
+                        SELECT e.id, e.name, e.entity_type, a.alias
+                        FROM kg_entity_aliases a
+                        JOIN kg_entities e ON a.entity_id = e.id
+                        WHERE a.alias_type = 'phonetic'
+                          AND (a.alias LIKE '%' || ? || '%' OR ? = '')
+                        LIMIT 100
+                        """,
+                        (first_token, first_token),
+                    ).fetchall()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@hooks/brainlayer-prompt-search.py` around lines 400 - 427, The current
phonetic matching does a full table scan by calling conn.execute(...) to load
phonetic_rows and scoring in Python; change this to push token filtering and
bounding into SQL and/or cache the table: compute query_tokens via
phonetic_tokens(candidate) as you already do, then build a parameterized SQL
WHERE clause that pre-filters aliases using alias LIKE '%token%' OR alias LIKE
'%token2%' for the tokens and add a LIMIT (e.g., LIMIT 100) to the query
executed via conn.execute to reduce rows returned (keep the same selected
columns e.id, e.name, e.entity_type, a.alias), and optionally implement a small
in-memory cache keyed by phonetic_key() to reuse the phonetic alias set (update
whatever cache-get/refresh helper you add in this module) so that phonetic_rows
is fetched from cache instead of re-querying on every call; keep the downstream
scoring logic (best_row/best_score) the same but operate only on the
pre-filtered/limited result set.

Comment thread scripts/seed_aliases.py
Comment on lines +17 to +27
def _iter_entities(store: VectorStore) -> list[tuple[str, str, str]]:
"""Return entity id, name, and canonical name for all KG entities."""
return list(
store._read_cursor().execute(
"""
SELECT id, name, canonical_name
FROM kg_entities
ORDER BY name
"""
)
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Fix _iter_entities return typing for nullable canonical_name.

Line 47 already handles canonical_name as nullable, but Line 17 annotates it as always str. Please make the tuple type reflect None possibility.

Proposed fix
-def _iter_entities(store: VectorStore) -> list[tuple[str, str, str]]:
+def _iter_entities(store: VectorStore) -> list[tuple[str, str, str | None]]:
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/seed_aliases.py` around lines 17 - 27, The _iter_entities function
annotation incorrectly claims canonical_name is always a str; update its return
type to allow None (e.g., change list[tuple[str, str, str]] to list[tuple[str,
str, Optional[str]]]) and add/import Optional from typing, keeping the function
body unchanged (this reflects that kg_entities.canonical_name can be NULL and
matches downstream handling of canonical_name).

Comment thread scripts/seed_aliases.py
Comment on lines +30 to +55
def seed_aliases(db_path: str | Path) -> int:
"""Populate known aliases and BMPM phonetic aliases. Returns rows inserted."""
store = VectorStore(Path(db_path))
inserted = 0
try:
entities_by_name = {name: entity_id for entity_id, name, _canonical_name in _iter_entities(store)}

for alias, entity_name, alias_type in KNOWN_ALIASES:
entity_id = entities_by_name.get(entity_name)
if entity_id is None:
continue
before = len(store.get_entity_aliases(entity_id))
store.add_entity_alias(alias, entity_id, alias_type=alias_type)
after = len(store.get_entity_aliases(entity_id))
inserted += max(0, after - before)

for entity_id, name, canonical_name in _iter_entities(store):
for candidate in {name, canonical_name or ""}:
key = phonetic_key(candidate)
if not key:
continue
before = len(store.get_entity_aliases(entity_id))
store.add_entity_alias(key, entity_id, alias_type="phonetic")
after = len(store.get_entity_aliases(entity_id))
inserted += max(0, after - before)
finally:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Serialize this entire seeding path; writes are currently unguarded.

Lines 42 and 52 perform write operations in loops, but this script does not enforce a one-writer-at-a-time guard. That risks contention with other MCP write-heavy flows (e.g., digest) and violates the repo’s write-serialization rule.

As per coding guidelines, "Enforce one-write-at-a-time concurrency constraint; reads are safe but brain_digest is write-heavy and must not run in parallel with other MCP work."

Comment thread scripts/seed_aliases.py
Comment on lines +35 to +40
entities_by_name = {name: entity_id for entity_id, name, _canonical_name in _iter_entities(store)}

for alias, entity_name, alias_type in KNOWN_ALIASES:
entity_id = entities_by_name.get(entity_name)
if entity_id is None:
continue
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

# 1) Check whether kg_entities.name is uniquely constrained in schema/migrations.
rg -nP 'CREATE TABLE\s+kg_entities|UNIQUE\s*\(\s*name\s*\)|name\s+\w+\s+UNIQUE' -C4

# 2) Inspect insert/write paths for kg_entities to confirm duplicate names cannot be created.
rg -nP 'INSERT\s+INTO\s+kg_entities|add_entity\s*\(' -C3 --type=py

Repository: EtanHey/brainlayer

Length of output: 9855


🏁 Script executed:

fd -type f -name "*.py" | grep -E "(migration|schema|init.*db)" | head -20

Repository: EtanHey/brainlayer

Length of output: 232


🏁 Script executed:

find . -type f -name "*.sql" | head -20

Repository: EtanHey/brainlayer

Length of output: 44


🏁 Script executed:

cat -n scripts/seed_aliases.py | head -50

Repository: EtanHey/brainlayer

Length of output: 2070


🏁 Script executed:

rg -n "kg_entities" src/brainlayer/vector_store.py -A 20 | grep -E "CREATE TABLE|UNIQUE|CONSTRAINT" | head -20

Repository: EtanHey/brainlayer

Length of output: 349


🏁 Script executed:

rg -n "entity_type" scripts/seed_aliases.py -B 5 -A 5

Repository: EtanHey/brainlayer

Length of output: 44


🏁 Script executed:

rg -n "INJECT_TYPES|entity_type" hooks/brainlayer-prompt-search.py -B 2 -A 2 | head -40

Repository: EtanHey/brainlayer

Length of output: 1714


🏁 Script executed:

cat -n src/brainlayer/vector_store.py | sed -n '491,510p'

Repository: EtanHey/brainlayer

Length of output: 1201


🏁 Script executed:

# Check if there are any references to duplicate handling or validation for names
rg -n "duplicate.*name|name.*duplicate" --type py

Repository: EtanHey/brainlayer

Length of output: 620


🏁 Script executed:

cat -n scripts/dedup_entities.py | head -50

Repository: EtanHey/brainlayer

Length of output: 2016


🏁 Script executed:

# Check if there's any test or comment about duplicate names across entity types
rg -n "same.*name.*different.*type|entity_type.*name" -B 3 -A 3 scripts/seed_aliases.py

Repository: EtanHey/brainlayer

Length of output: 44


🏁 Script executed:

# Check if KNOWN_ALIASES entities actually exist and if there are duplicates
rg -n "Etan Heyman" --type py -B 2 -A 2

Repository: EtanHey/brainlayer

Length of output: 36166


🏁 Script executed:

cat -n scripts/seed_aliases.py | sed -n '30,60p'

Repository: EtanHey/brainlayer

Length of output: 1555


🏁 Script executed:

# Check if there's any entity_type handling or filtering in seed_aliases
rg -n "entity_type" scripts/seed_aliases.py -B 3 -A 3

Repository: EtanHey/brainlayer

Length of output: 44


🏁 Script executed:

# Search for any tests of seed_aliases with duplicate names
rg -n "seed_aliases|seed.*alias" tests/ --type py -B 2 -A 5 | head -50

Repository: EtanHey/brainlayer

Length of output: 2640


Verify name → entity_id mapping for duplicate names across entity types.

Line 35 creates a dict that silently overwrites if kg_entities.name appears more than once. The schema constraint is UNIQUE(entity_type, name) (not on name alone), so duplicate names are possible across different entity types—for example, "Atan Heyman" could exist as both person and project. If such duplicates exist in the database, the dict comprehension keeps only the last entity_id seen, and subsequent alias seeds could target the wrong entity.

Guard against this by either:

  1. Checking that duplicate names do not exist, or
  2. Filtering/validating entity_type before seeding aliases, or
  3. Updating the schema to constrain name uniqueness globally (if appropriate for your model).
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/seed_aliases.py` around lines 35 - 40, The dict comprehension
building entities_by_name from _iter_entities silently drops duplicate names
(since schema is UNIQUE(entity_type, name) not name), so detect and handle
duplicates: build a map name -> list(entity_id) from _iter_entities, then if any
name maps to multiple entity_ids either (a) raise an explicit error with the
conflicting name(s) and entity_ids so the seed run fails early, or (b) require
KNOWN_ALIASES entries to specify entity_type and filter the list by that type
when resolving entity_id; update the resolution logic that currently uses
entities_by_name.get(entity_name) to consult the list and disambiguate by
entity_type (or fail) to avoid seeding aliases to the wrong entity.

Comment thread scripts/seed_aliases.py
Comment on lines +41 to +54
before = len(store.get_entity_aliases(entity_id))
store.add_entity_alias(alias, entity_id, alias_type=alias_type)
after = len(store.get_entity_aliases(entity_id))
inserted += max(0, after - before)

for entity_id, name, canonical_name in _iter_entities(store):
for candidate in {name, canonical_name or ""}:
key = phonetic_key(candidate)
if not key:
continue
before = len(store.get_entity_aliases(entity_id))
store.add_entity_alias(key, entity_id, alias_type="phonetic")
after = len(store.get_entity_aliases(entity_id))
inserted += max(0, after - before)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Avoid get_entity_aliases() before/after every insert.

This does two full alias reads per attempted insert, which is expensive and can miscount inserted under concurrent writes. Prefer an insert result signal (e.g., rowcount/changes) or a preloaded alias set updated in-memory during this run.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/seed_aliases.py` around lines 41 - 54, The current loop uses
get_entity_aliases() before/after each add_entity_alias call (in the block that
also uses phonetic_key and _iter_entities), causing two full reads per insert
and racey inserted counts; instead, change the logic in the loops that call
get_entity_aliases(), add_entity_alias and phonetic_key to either (A) rely on
the DB/client insert result (e.g., rowcount/changes returned by
add_entity_alias) to increment inserted, or (B) preload each entity's alias set
once into memory before inserts and update that in-memory set when
add_entity_alias succeeds so you only check membership in-memory (and increment
inserted when you actually add a new alias); update references to
get_entity_aliases, add_entity_alias, phonetic_key, _iter_entities and the
inserted counter accordingly.

Comment thread src/brainlayer/kg_repo.py
Comment on lines +504 to +564
def get_entity_by_phonetic_alias(
self,
name: str,
entity_type: Optional[str] = None,
) -> Optional[Dict[str, Any]]:
"""Look up an entity by stored BMPM phonetic aliases."""
query_key = phonetic_key(name)
query_tokens = phonetic_tokens(name)
if not query_key or not query_tokens:
return None

cursor = self._read_cursor()
if entity_type:
rows = list(
cursor.execute(
"""
SELECT e.id, e.entity_type, e.name, e.metadata, e.created_at, e.updated_at, a.alias
FROM kg_entity_aliases a
JOIN kg_entities e ON a.entity_id = e.id
WHERE a.alias_type = 'phonetic' AND e.entity_type = ?
""",
(entity_type,),
)
)
else:
rows = list(
cursor.execute(
"""
SELECT e.id, e.entity_type, e.name, e.metadata, e.created_at, e.updated_at, a.alias
FROM kg_entity_aliases a
JOIN kg_entities e ON a.entity_id = e.id
WHERE a.alias_type = 'phonetic'
"""
)
)

best_match = None
best_score = 0.0
for row in rows:
alias_tokens = {token for token in str(row[6]).split() if token}
if not alias_tokens:
continue
overlap = query_tokens & alias_tokens
if not overlap:
continue
score = len(overlap) / len(query_tokens | alias_tokens)
if row[6] == query_key:
score = 1.0
if score <= best_score:
continue
best_score = score
best_match = {
"id": row[0],
"entity_type": row[1],
"name": row[2],
"metadata": json.loads(row[3]) if row[3] else {},
"created_at": row[4],
"updated_at": row[5],
}

return best_match
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Performance concern: full table scan for phonetic alias matching.

get_entity_by_phonetic_alias fetches all rows where alias_type='phonetic' and scores them in Python. This duplicates the same pattern in the prompt-search hook and has the same O(n) concern.

Consider:

  1. Adding SQL-side filtering by a token prefix
  2. Adding a LIMIT clause
  3. Extracting shared scoring logic into a helper to avoid divergence
♻️ Proposed optimization sketch
+        # Pre-filter by first token to reduce scan size
+        first_token = next(iter(query_tokens), "")
         if entity_type:
             rows = list(
                 cursor.execute(
                     """
                     SELECT e.id, e.entity_type, e.name, e.metadata, e.created_at, e.updated_at, a.alias
                     FROM kg_entity_aliases a
                     JOIN kg_entities e ON a.entity_id = e.id
-                    WHERE a.alias_type = 'phonetic' AND e.entity_type = ?
+                    WHERE a.alias_type = 'phonetic' AND e.entity_type = ?
+                      AND (a.alias LIKE '%' || ? || '%' OR ? = '')
+                    LIMIT 200
                     """,
-                    (entity_type,),
+                    (entity_type, first_token, first_token),
                 )
             )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/brainlayer/kg_repo.py` around lines 504 - 564,
get_entity_by_phonetic_alias is doing a full table scan of
kg_entity_aliases/kg_entities and scoring in Python which is O(n); change it to
push cheap filtering into SQL (use phonetic_key/phonetic_tokens to take the
first token or token prefixes and add a WHERE clause like a.alias LIKE ? or
token_prefix match and include a LIMIT to bound rows), and extract the
overlap/score computation into a shared helper (e.g.,
phonetic_score(query_tokens, alias_string)) used by get_entity_by_phonetic_alias
and the prompt-search hook so scoring logic is centralized and consistent;
ensure you still treat exact key matches (phonetic_key) as score=1.0 and decode
metadata with json.loads as before.

Comment on lines +84 to +93
def phonetic_match(name1: str, name2: str, threshold: float = 0.7) -> bool:
"""Check if two names match phonetically."""
set1 = phonetic_tokens(name1)
set2 = phonetic_tokens(name2)
if not set1 or not set2:
return False
if set1 & set2:
return True
similarity = len(set1 & set2) / len(set1 | set2)
return similarity >= threshold
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Logic bug: threshold parameter is ineffective.

The function has dead code:

  • Line 90-91: Returns True if any token overlap exists
  • Line 92: If we reach this line, set1 & set2 is empty, so similarity is always 0
  • Line 93: Always returns False since 0 >= threshold is false

The threshold parameter is never actually used for scoring.

🐛 Proposed fix

If the intent is to require a minimum similarity score:

 def phonetic_match(name1: str, name2: str, threshold: float = 0.7) -> bool:
     """Check if two names match phonetically."""
     set1 = phonetic_tokens(name1)
     set2 = phonetic_tokens(name2)
     if not set1 or not set2:
         return False
-    if set1 & set2:
-        return True
     similarity = len(set1 & set2) / len(set1 | set2)
     return similarity >= threshold

Or if any overlap should always match, remove the misleading parameter:

-def phonetic_match(name1: str, name2: str, threshold: float = 0.7) -> bool:
+def phonetic_match(name1: str, name2: str) -> bool:
     """Check if two names match phonetically."""
     set1 = phonetic_tokens(name1)
     set2 = phonetic_tokens(name2)
     if not set1 or not set2:
         return False
-    if set1 & set2:
-        return True
-    similarity = len(set1 & set2) / len(set1 | set2)
-    return similarity >= threshold
+    return bool(set1 & set2)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def phonetic_match(name1: str, name2: str, threshold: float = 0.7) -> bool:
"""Check if two names match phonetically."""
set1 = phonetic_tokens(name1)
set2 = phonetic_tokens(name2)
if not set1 or not set2:
return False
if set1 & set2:
return True
similarity = len(set1 & set2) / len(set1 | set2)
return similarity >= threshold
def phonetic_match(name1: str, name2: str, threshold: float = 0.7) -> bool:
"""Check if two names match phonetically."""
set1 = phonetic_tokens(name1)
set2 = phonetic_tokens(name2)
if not set1 or not set2:
return False
similarity = len(set1 & set2) / len(set1 | set2)
return similarity >= threshold
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/brainlayer/phonetic.py` around lines 84 - 93, phonetic_match currently
returns True on any token overlap and never uses threshold; change the logic in
phonetic_match (which calls phonetic_tokens) to first handle empty sets, then
compute similarity = len(set1 & set2) / len(set1 | set2) and return similarity
>= threshold (removing the unconditional if set1 & set2: return True), or
alternatively remove the threshold parameter if you intend any overlap to match;
update phonetic_match accordingly so the threshold actually controls matching.

@EtanHey EtanHey force-pushed the feat/p2b-hebrew-alias branch from cc7c5cc to 33512e8 Compare April 4, 2026 23:58
@EtanHey EtanHey merged commit df9dd88 into main Apr 5, 2026
5 of 6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant