feat: co-occurrence relation extraction for KG edges#178
Conversation
Root cause: brain_digest found entities but never created edges because relation extraction only ran via the LLM path (which requires an llm_caller). No rule-based relation extractor existed. Fix: Added extract_cooccurrence_relations() — infers "co_occurs_with" relations between entities of different types in the same text. Runs without LLM, integrated into extract_entities_combined() as step 4. Confidence = min(entity_a, entity_b) * 0.7 (conservative). Same-type pairs skipped (too noisy). Deduplicated. 5 new tests covering: pair creation, single entity, dedup, same-type exclusion, integration with combined extraction. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.
📝 WalkthroughWalkthroughThe changes add a new Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 5
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/brainlayer/pipeline/entity_extraction.py`:
- Line 517: The current pair key uses raw casing (pair = (a.text, b.text) if
a.text < b.text else (b.text, a.text)), which causes inconsistent deduplication
for the same entity with different case; change the comparison to use a
case-normalized form (use Unicode-aware casefold or lowercase) for ordering
(e.g., compare a_key = a.text.casefold() and b_key = b.text.casefold()), but
keep the original a.text/b.text values for the stored pair if you need to
preserve original casing; update the code that constructs pair in
entity_extraction.py to order by the normalized keys instead of the raw texts.
- Around line 513-520: The nested pairwise loop in entity_extraction.py (the for
i, a in enumerate(entities): for b in entities[i + 1 :] loop that builds
relations and uses seen) can blow up to O(n^2); before generating pairs filter
and/or cap candidates (e.g., filter entities by a confidence threshold field
like entity.confidence, or limit to top-K entities by confidence), then produce
at most max_relations_per_chunk relations (stop generating once count reaches
the cap). Implement this in the code around the loop that builds pair and seen:
first create a filtered_entities list (apply threshold or take top-K by
confidence), then run the pairwise loop on that list and break early when the
relation count reaches max_relations_per_chunk; make max_relations_per_chunk and
confidence threshold configurable constants or function parameters.
In `@tests/test_kg_relations.py`:
- Around line 10-52: Add two edge-case tests to TestCooccurrenceRelations:
implement test_empty_entities_returns_empty which calls
extract_cooccurrence_relations([]) and asserts an empty list (relations == []),
and implement test_three_different_types_produce_three_relations which creates
three ExtractedEntity instances of distinct entity_type values and asserts
len(extract_cooccurrence_relations(entities)) == 3 to verify the N*(N-1)/2
co-occurrence behavior; reference the existing ExtractedEntity and
extract_cooccurrence_relations symbols when adding these tests.
- Around line 35-43: The test test_no_duplicate_relations doesn't exercise
deduplication because it supplies only unique entities; update it to include
duplicate entity instances (e.g., two ExtractedEntity objects with the same
text/entity_type/start/end or same text/source) so
extract_cooccurrence_relations receives repeated entity pairs, then assert that
only one relation is produced for that duplicate pair (check length and that
pairs == set(pairs) or that count of that specific pair == 1). Locate this logic
in the test_no_duplicate_relations function and modify the entities list (using
ExtractedEntity) and the assertions accordingly so the deduplication behavior of
extract_cooccurrence_relations is actually validated.
- Around line 13-25: Update the test_two_entities_produce_relation test to
assert the exact confidence value produced by extract_cooccurrence_relations:
compute expected_conf = min(0.9, 0.8) * 0.7 and replace the loose check (0 <
rel.confidence <= 1.0) with an exact assertion rel.confidence == expected_conf;
ensure you reference the ExtractedEntity instances used in the test and the
relation object rel returned by extract_cooccurrence_relations and also add
import pytest at the top of the test file if the test runner or assertion style
requires it.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: d100aa5a-160f-4177-b456-d17123f4de51
📒 Files selected for processing (2)
src/brainlayer/pipeline/entity_extraction.pytests/test_kg_relations.py
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: test (3.11)
- GitHub Check: test (3.13)
- GitHub Check: test (3.12)
🧰 Additional context used
📓 Path-based instructions (3)
**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
**/*.py: Flag risky DB or concurrency changes explicitly and do not hand-wave lock behavior
Enforce one-write-at-a-time concurrency constraint; reads are safe but brain_digest is write-heavy and must not run in parallel with other MCP work
Run pytest before claiming behavior changed safely; current test suite has 929 tests
Files:
tests/test_kg_relations.pysrc/brainlayer/pipeline/entity_extraction.py
tests/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
Use
pytestfor testing
Files:
tests/test_kg_relations.py
src/brainlayer/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
src/brainlayer/**/*.py: Use Python/Typer CLI architecture for the main package insrc/brainlayer/
All scripts and CLI must usepaths.py:get_db_path()for resolving database path instead of hardcoding
Implement chunk lifecycle management with columnssuperseded_by,aggregated_into,archived_at; default search excludes lifecycle-managed chunks
Never run bulk database operations while enrichment workers are writing; always stop workers and checkpoint WAL first
Drop FTS triggers before bulk deletes onchunkstable and recreate after; batch deletes in 5-10K chunks with checkpoint every 3 batches
Implement retry logic onSQLITE_BUSYerrors; each worker must use its own database connection
Useruff check src/ && ruff format src/for linting and formatting
Files:
src/brainlayer/pipeline/entity_extraction.py
🧠 Learnings (1)
📓 Common learnings
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-03-14T02:20:54.656Z
Learning: Request codex review, cursor review, and bugbot review for BrainLayer PRs
🔇 Additional comments (3)
src/brainlayer/pipeline/entity_extraction.py (2)
500-531:co_occurs_withwill be normalized torelated_todownstream.Per
kg_extraction.pylines 99-102, anyrelation_typenot inCANONICAL_RELATION_TYPESis normalized to"related_to". The set at lines 31-41 does not include"co_occurs_with".If the semantic distinction matters (e.g., for filtering or weighting co-occurrence edges differently), add
"co_occurs_with"toCANONICAL_RELATION_TYPES. If all you need is some edge, the current behavior is fine—just note thatco_occurs_withbecomesrelated_toafter validation.[raise_major_issue, request_verification]
#!/bin/bash # Verify CANONICAL_RELATION_TYPES contents rg -n "CANONICAL_RELATION_TYPES" src/brainlayer/pipeline/kg_extraction.py -A 12
580-582: Integration looks correct—co-occurrence runs on deduplicated entities.Running after entity deduplication ensures relations reference the canonical entity text. The relations are appended to
all_relations, preserving any LLM-extracted relations whenuse_llm=True.tests/test_kg_relations.py (1)
58-69: Integration test validates the core PR objective.This test confirms that
extract_entities_combinedproducesco_occurs_withrelations without an LLM, which is the main fix described in the PR objectives.
| for i, a in enumerate(entities): | ||
| for b in entities[i + 1 :]: | ||
| if a.entity_type == b.entity_type: | ||
| continue | ||
| pair = (a.text, b.text) if a.text < b.text else (b.text, a.text) | ||
| if pair in seen: | ||
| continue | ||
| seen.add(pair) |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Potential quadratic blowup for large entity lists.
For n entities, this produces up to n*(n-1)/2 relations. With 50+ entities in a chunk, you get 1000+ relations—most of which may be noise.
Consider capping output (e.g., top-k by confidence) or limiting to entities above a confidence threshold. This is a minor concern for typical chunk sizes but could degrade performance on outlier chunks.
🛡️ Optional: Add a cap on relation count
+MAX_COOCCURRENCE_RELATIONS = 50 # Cap to avoid noise explosion
+
def extract_cooccurrence_relations(entities: list[ExtractedEntity]) -> list[ExtractedRelation]:
...
relations: list[ExtractedRelation] = []
seen: set[tuple[str, str]] = set()
for i, a in enumerate(entities):
for b in entities[i + 1 :]:
+ if len(relations) >= MAX_COOCCURRENCE_RELATIONS:
+ break
if a.entity_type == b.entity_type:
continue
...
+ if len(relations) >= MAX_COOCCURRENCE_RELATIONS:
+ break
return relations🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/brainlayer/pipeline/entity_extraction.py` around lines 513 - 520, The
nested pairwise loop in entity_extraction.py (the for i, a in
enumerate(entities): for b in entities[i + 1 :] loop that builds relations and
uses seen) can blow up to O(n^2); before generating pairs filter and/or cap
candidates (e.g., filter entities by a confidence threshold field like
entity.confidence, or limit to top-K entities by confidence), then produce at
most max_relations_per_chunk relations (stop generating once count reaches the
cap). Implement this in the code around the loop that builds pair and seen:
first create a filtered_entities list (apply threshold or take top-K by
confidence), then run the pairwise loop on that list and break early when the
relation count reaches max_relations_per_chunk; make max_relations_per_chunk and
confidence threshold configurable constants or function parameters.
| for b in entities[i + 1 :]: | ||
| if a.entity_type == b.entity_type: | ||
| continue | ||
| pair = (a.text, b.text) if a.text < b.text else (b.text, a.text) |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Lexicographic ordering uses raw text casing—may cause inconsistent deduplication.
a.text < b.text compares by Unicode codepoint, so "Apple" vs "apple" yields different orderings. If the same entity appears with different casing from different sources (e.g., "SQLite" vs "sqlite"), they won't dedupe correctly.
Consider normalizing to lowercase for the pair key:
♻️ Proposed fix
- pair = (a.text, b.text) if a.text < b.text else (b.text, a.text)
+ key_a, key_b = a.text.lower(), b.text.lower()
+ pair = (key_a, key_b) if key_a < key_b else (key_b, key_a)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| pair = (a.text, b.text) if a.text < b.text else (b.text, a.text) | |
| key_a, key_b = a.text.lower(), b.text.lower() | |
| pair = (key_a, key_b) if key_a < key_b else (key_b, key_a) |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/brainlayer/pipeline/entity_extraction.py` at line 517, The current pair
key uses raw casing (pair = (a.text, b.text) if a.text < b.text else (b.text,
a.text)), which causes inconsistent deduplication for the same entity with
different case; change the comparison to use a case-normalized form (use
Unicode-aware casefold or lowercase) for ordering (e.g., compare a_key =
a.text.casefold() and b_key = b.text.casefold()), but keep the original
a.text/b.text values for the stored pair if you need to preserve original
casing; update the code that constructs pair in entity_extraction.py to order by
the normalized keys instead of the raw texts.
| class TestCooccurrenceRelations: | ||
| """Rule-based relation extraction from co-occurring entities.""" | ||
|
|
||
| def test_two_entities_produce_relation(self): | ||
| """Two entities in the same text should produce a co-occurrence relation.""" | ||
| entities = [ | ||
| ExtractedEntity(text="BrainLayer", entity_type="project", start=0, end=10, confidence=0.9, source="seed"), | ||
| ExtractedEntity(text="SQLite", entity_type="technology", start=20, end=26, confidence=0.8, source="seed"), | ||
| ] | ||
| relations = extract_cooccurrence_relations(entities) | ||
| assert len(relations) >= 1 | ||
| rel = relations[0] | ||
| assert rel.source_text == "BrainLayer" | ||
| assert rel.target_text == "SQLite" | ||
| assert rel.relation_type == "co_occurs_with" | ||
| assert 0 < rel.confidence <= 1.0 | ||
|
|
||
| def test_no_relations_for_single_entity(self): | ||
| """A single entity can't have co-occurrence relations.""" | ||
| entities = [ | ||
| ExtractedEntity(text="BrainLayer", entity_type="project", start=0, end=10, confidence=0.9, source="seed"), | ||
| ] | ||
| relations = extract_cooccurrence_relations(entities) | ||
| assert len(relations) == 0 | ||
|
|
||
| def test_no_duplicate_relations(self): | ||
| """Same entity pair should produce at most one relation.""" | ||
| entities = [ | ||
| ExtractedEntity(text="Foo", entity_type="project", start=0, end=3, confidence=0.9, source="seed"), | ||
| ExtractedEntity(text="Bar", entity_type="technology", start=10, end=13, confidence=0.8, source="seed"), | ||
| ] | ||
| relations = extract_cooccurrence_relations(entities) | ||
| pairs = [(r.source_text, r.target_text) for r in relations] | ||
| assert len(pairs) == len(set(pairs)) | ||
|
|
||
| def test_same_type_entities_not_related(self): | ||
| """Entities of the same type shouldn't get co-occurrence relations (too noisy).""" | ||
| entities = [ | ||
| ExtractedEntity(text="Foo", entity_type="project", start=0, end=3, confidence=0.9, source="seed"), | ||
| ExtractedEntity(text="Bar", entity_type="project", start=10, end=13, confidence=0.8, source="seed"), | ||
| ] | ||
| relations = extract_cooccurrence_relations(entities) | ||
| assert len(relations) == 0 |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Consider adding edge case tests.
Missing coverage for:
- Empty entity list (should return empty list)
- Three entities of different types (should produce 3 relations)
💚 Additional test cases
def test_empty_entities_returns_empty(self):
"""Empty input should return empty relations."""
relations = extract_cooccurrence_relations([])
assert relations == []
def test_three_different_types_produce_three_relations(self):
"""N entities of different types produce N*(N-1)/2 relations."""
entities = [
ExtractedEntity(text="A", entity_type="project", start=0, end=1, confidence=0.9, source="seed"),
ExtractedEntity(text="B", entity_type="technology", start=5, end=6, confidence=0.8, source="seed"),
ExtractedEntity(text="C", entity_type="person", start=10, end=11, confidence=0.7, source="seed"),
]
relations = extract_cooccurrence_relations(entities)
assert len(relations) == 3 # A-B, A-C, B-C🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/test_kg_relations.py` around lines 10 - 52, Add two edge-case tests to
TestCooccurrenceRelations: implement test_empty_entities_returns_empty which
calls extract_cooccurrence_relations([]) and asserts an empty list (relations ==
[]), and implement test_three_different_types_produce_three_relations which
creates three ExtractedEntity instances of distinct entity_type values and
asserts len(extract_cooccurrence_relations(entities)) == 3 to verify the
N*(N-1)/2 co-occurrence behavior; reference the existing ExtractedEntity and
extract_cooccurrence_relations symbols when adding these tests.
| def test_two_entities_produce_relation(self): | ||
| """Two entities in the same text should produce a co-occurrence relation.""" | ||
| entities = [ | ||
| ExtractedEntity(text="BrainLayer", entity_type="project", start=0, end=10, confidence=0.9, source="seed"), | ||
| ExtractedEntity(text="SQLite", entity_type="technology", start=20, end=26, confidence=0.8, source="seed"), | ||
| ] | ||
| relations = extract_cooccurrence_relations(entities) | ||
| assert len(relations) >= 1 | ||
| rel = relations[0] | ||
| assert rel.source_text == "BrainLayer" | ||
| assert rel.target_text == "SQLite" | ||
| assert rel.relation_type == "co_occurs_with" | ||
| assert 0 < rel.confidence <= 1.0 |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Test should verify the confidence calculation.
The test checks 0 < rel.confidence <= 1.0 but doesn't verify the expected value min(0.9, 0.8) * 0.7 = 0.56. Adding an exact assertion would catch regressions in the confidence formula.
💚 Proposed enhancement
assert 0 < rel.confidence <= 1.0
+ # min(0.9, 0.8) * 0.7 = 0.56
+ assert rel.confidence == pytest.approx(0.56)Add import pytest at the top of the file.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/test_kg_relations.py` around lines 13 - 25, Update the
test_two_entities_produce_relation test to assert the exact confidence value
produced by extract_cooccurrence_relations: compute expected_conf = min(0.9,
0.8) * 0.7 and replace the loose check (0 < rel.confidence <= 1.0) with an exact
assertion rel.confidence == expected_conf; ensure you reference the
ExtractedEntity instances used in the test and the relation object rel returned
by extract_cooccurrence_relations and also add import pytest at the top of the
test file if the test runner or assertion style requires it.
| def test_no_duplicate_relations(self): | ||
| """Same entity pair should produce at most one relation.""" | ||
| entities = [ | ||
| ExtractedEntity(text="Foo", entity_type="project", start=0, end=3, confidence=0.9, source="seed"), | ||
| ExtractedEntity(text="Bar", entity_type="technology", start=10, end=13, confidence=0.8, source="seed"), | ||
| ] | ||
| relations = extract_cooccurrence_relations(entities) | ||
| pairs = [(r.source_text, r.target_text) for r in relations] | ||
| assert len(pairs) == len(set(pairs)) |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Test doesn't fully verify deduplication—only checks uniqueness of returned pairs.
This test passes even if duplicates were never possible in the first place. To properly test deduplication, provide duplicate entity pairs (e.g., same text appearing twice in the input list) and verify only one relation is produced.
💚 Proposed fix to test actual deduplication logic
def test_no_duplicate_relations(self):
- """Same entity pair should produce at most one relation."""
+ """Duplicate entity pairs in input should produce only one relation."""
entities = [
ExtractedEntity(text="Foo", entity_type="project", start=0, end=3, confidence=0.9, source="seed"),
ExtractedEntity(text="Bar", entity_type="technology", start=10, end=13, confidence=0.8, source="seed"),
+ # Duplicate mention of same pair
+ ExtractedEntity(text="Foo", entity_type="project", start=20, end=23, confidence=0.85, source="gliner"),
+ ExtractedEntity(text="Bar", entity_type="technology", start=30, end=33, confidence=0.75, source="gliner"),
]
relations = extract_cooccurrence_relations(entities)
- pairs = [(r.source_text, r.target_text) for r in relations]
- assert len(pairs) == len(set(pairs))
+ # Should only produce one Foo-Bar relation despite 4 possible pairings
+ foo_bar_relations = [r for r in relations if {r.source_text, r.target_text} == {"Foo", "Bar"}]
+ assert len(foo_bar_relations) == 1📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def test_no_duplicate_relations(self): | |
| """Same entity pair should produce at most one relation.""" | |
| entities = [ | |
| ExtractedEntity(text="Foo", entity_type="project", start=0, end=3, confidence=0.9, source="seed"), | |
| ExtractedEntity(text="Bar", entity_type="technology", start=10, end=13, confidence=0.8, source="seed"), | |
| ] | |
| relations = extract_cooccurrence_relations(entities) | |
| pairs = [(r.source_text, r.target_text) for r in relations] | |
| assert len(pairs) == len(set(pairs)) | |
| def test_no_duplicate_relations(self): | |
| """Duplicate entity pairs in input should produce only one relation.""" | |
| entities = [ | |
| ExtractedEntity(text="Foo", entity_type="project", start=0, end=3, confidence=0.9, source="seed"), | |
| ExtractedEntity(text="Bar", entity_type="technology", start=10, end=13, confidence=0.8, source="seed"), | |
| # Duplicate mention of same pair | |
| ExtractedEntity(text="Foo", entity_type="project", start=20, end=23, confidence=0.85, source="gliner"), | |
| ExtractedEntity(text="Bar", entity_type="technology", start=30, end=33, confidence=0.75, source="gliner"), | |
| ] | |
| relations = extract_cooccurrence_relations(entities) | |
| # Should only produce one Foo-Bar relation despite 4 possible pairings | |
| foo_bar_relations = [r for r in relations if {r.source_text, r.target_text} == {"Foo", "Bar"}] | |
| assert len(foo_bar_relations) == 1 |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/test_kg_relations.py` around lines 35 - 43, The test
test_no_duplicate_relations doesn't exercise deduplication because it supplies
only unique entities; update it to include duplicate entity instances (e.g., two
ExtractedEntity objects with the same text/entity_type/start/end or same
text/source) so extract_cooccurrence_relations receives repeated entity pairs,
then assert that only one relation is produced for that duplicate pair (check
length and that pairs == set(pairs) or that count of that specific pair == 1).
Locate this logic in the test_no_duplicate_relations function and modify the
entities list (using ExtractedEntity) and the assertions accordingly so the
deduplication behavior of extract_cooccurrence_relations is actually validated.
Summary
Root cause found: brain_digest found entities via seed matching but never created knowledge graph edges. Relation extraction only ran via the LLM path (
extract_entities_llm), which requires anllm_callerparameter. When brain_digest callsprocess_chunk()without an LLM, the relations list is always empty.Fix: Added
extract_cooccurrence_relations()— a rule-based relation extractor that infersco_occurs_withrelations between entities of different types that appear in the same text. This runs as step 4 inextract_entities_combined(), always active regardless of LLM availability.How it works
min(entity_a.confidence, entity_b.confidence) * 0.7Stub tools status
Investigated brain_update, brain_expand, brain_tags — they already return
isError: truewith deprecation messages. Not fake-success stubs. No fix needed.MCP response visibility
Investigated brain_search response format — Python MCP server uses clean
TextContent(type="text", text=...)with no ANSI codes. Format is correct per MCP spec. The "invisible results" issue is a Claude Code UI behavior (collapses tool results), not a BrainLayer bug.Test plan
tests/test_kg_relations.pyruff check+ruff formatclean🤖 Generated with Claude Code
Note
Add co-occurrence relation extraction for knowledge graph edges
extract_cooccurrence_relationsin entity_extraction.py that generatesco_occurs_withrelations between entities of different types appearing in the same text.(a.text, b.text)key; confidence is set to0.7 * min(a.confidence, b.confidence).extract_entities_combinednow calls this function after sorting final entities and extendsall_relationswith the result, regardless of whether LLM extraction is enabled.Macroscope summarized 6487081.
Summary by CodeRabbit
Release Notes
New Features
Tests