Skip to content

feat: co-occurrence relation extraction for KG edges#178

Merged
EtanHey merged 1 commit intomainfrom
fix/kg-relations-stub-cleanup
Apr 2, 2026
Merged

feat: co-occurrence relation extraction for KG edges#178
EtanHey merged 1 commit intomainfrom
fix/kg-relations-stub-cleanup

Conversation

@EtanHey
Copy link
Copy Markdown
Owner

@EtanHey EtanHey commented Apr 2, 2026

Summary

Root cause found: brain_digest found entities via seed matching but never created knowledge graph edges. Relation extraction only ran via the LLM path (extract_entities_llm), which requires an llm_caller parameter. When brain_digest calls process_chunk() without an LLM, the relations list is always empty.

Fix: Added extract_cooccurrence_relations() — a rule-based relation extractor that infers co_occurs_with relations between entities of different types that appear in the same text. This runs as step 4 in extract_entities_combined(), always active regardless of LLM availability.

How it works

  • Pairs entities of different types (e.g., project + technology)
  • Skips same-type pairs (project + project) — too noisy
  • Deduplicates: each pair gets at most one relation
  • Confidence = min(entity_a.confidence, entity_b.confidence) * 0.7

Stub tools status

Investigated brain_update, brain_expand, brain_tags — they already return isError: true with deprecation messages. Not fake-success stubs. No fix needed.

MCP response visibility

Investigated brain_search response format — Python MCP server uses clean TextContent(type="text", text=...) with no ANSI codes. Format is correct per MCP spec. The "invisible results" issue is a Claude Code UI behavior (collapses tool results), not a BrainLayer bug.

Test plan

  • 5 new tests in tests/test_kg_relations.py
  • ruff check + ruff format clean

🤖 Generated with Claude Code

Note

Add co-occurrence relation extraction for knowledge graph edges

  • Adds extract_cooccurrence_relations in entity_extraction.py that generates co_occurs_with relations between entities of different types appearing in the same text.
  • Pairs are de-duplicated using a sorted (a.text, b.text) key; confidence is set to 0.7 * min(a.confidence, b.confidence).
  • extract_entities_combined now calls this function after sorting final entities and extends all_relations with the result, regardless of whether LLM extraction is enabled.
  • New tests in test_kg_relations.py cover cross-type pairing, single-entity edge cases, de-duplication, and same-type skipping.

Macroscope summarized 6487081.

Summary by CodeRabbit

Release Notes

  • New Features

    • Entity extraction now automatically infers relationships between entities of different types based on co-occurrence patterns. Confidence scores for these relationships are calculated from the minimum confidence of the related entities.
  • Tests

    • Added comprehensive test coverage for co-occurrence relationship extraction functionality.

Root cause: brain_digest found entities but never created edges because
relation extraction only ran via the LLM path (which requires an
llm_caller). No rule-based relation extractor existed.

Fix: Added extract_cooccurrence_relations() — infers "co_occurs_with"
relations between entities of different types in the same text. Runs
without LLM, integrated into extract_entities_combined() as step 4.

Confidence = min(entity_a, entity_b) * 0.7 (conservative).
Same-type pairs skipped (too noisy). Deduplicated.

5 new tests covering: pair creation, single entity, dedup, same-type
exclusion, integration with combined extraction.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@EtanHey
Copy link
Copy Markdown
Owner Author

EtanHey commented Apr 2, 2026

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 2, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Copy Markdown

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 2, 2026

📝 Walkthrough

Walkthrough

The changes add a new extract_cooccurrence_relations() function that automatically generates co_occurs_with relations between entities of differing types. The function computes confidence scores as the minimum of both entities' confidences multiplied by 0.7 and deduplicates relations by tracking seen entity pairs. This is integrated into the existing extract_entities_combined() pipeline to produce relations on every extraction call.

Changes

Cohort / File(s) Summary
Entity Co-occurrence Relations
src/brainlayer/pipeline/entity_extraction.py
Added extract_cooccurrence_relations() function to generate co_occurs_with relations between entities with different types, with deduplication and confidence calculation. Integrated into extract_entities_combined() to append co-occurrence relations to all extracted relations.
Relation Extraction Tests
tests/test_kg_relations.py
New test module with TestCooccurrenceRelations verifying correct generation, deduplication, and type-filtering of co-occurrence relations, and TestCombinedExtractsRelations confirming extract_entities_combined() produces relations without LLM involvement.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Poem

🐰 Relations bloom where entities meet,
Different types in co-occurrence sweet,
Confidence flows through deduped pairs,
A web of knowledge, layer by layer! ✨

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'feat: co-occurrence relation extraction for KG edges' directly and clearly describes the main change: a new feature for extracting co-occurrence relations to create knowledge graph edges.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/kg-relations-stub-cleanup

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/brainlayer/pipeline/entity_extraction.py`:
- Line 517: The current pair key uses raw casing (pair = (a.text, b.text) if
a.text < b.text else (b.text, a.text)), which causes inconsistent deduplication
for the same entity with different case; change the comparison to use a
case-normalized form (use Unicode-aware casefold or lowercase) for ordering
(e.g., compare a_key = a.text.casefold() and b_key = b.text.casefold()), but
keep the original a.text/b.text values for the stored pair if you need to
preserve original casing; update the code that constructs pair in
entity_extraction.py to order by the normalized keys instead of the raw texts.
- Around line 513-520: The nested pairwise loop in entity_extraction.py (the for
i, a in enumerate(entities): for b in entities[i + 1 :] loop that builds
relations and uses seen) can blow up to O(n^2); before generating pairs filter
and/or cap candidates (e.g., filter entities by a confidence threshold field
like entity.confidence, or limit to top-K entities by confidence), then produce
at most max_relations_per_chunk relations (stop generating once count reaches
the cap). Implement this in the code around the loop that builds pair and seen:
first create a filtered_entities list (apply threshold or take top-K by
confidence), then run the pairwise loop on that list and break early when the
relation count reaches max_relations_per_chunk; make max_relations_per_chunk and
confidence threshold configurable constants or function parameters.

In `@tests/test_kg_relations.py`:
- Around line 10-52: Add two edge-case tests to TestCooccurrenceRelations:
implement test_empty_entities_returns_empty which calls
extract_cooccurrence_relations([]) and asserts an empty list (relations == []),
and implement test_three_different_types_produce_three_relations which creates
three ExtractedEntity instances of distinct entity_type values and asserts
len(extract_cooccurrence_relations(entities)) == 3 to verify the N*(N-1)/2
co-occurrence behavior; reference the existing ExtractedEntity and
extract_cooccurrence_relations symbols when adding these tests.
- Around line 35-43: The test test_no_duplicate_relations doesn't exercise
deduplication because it supplies only unique entities; update it to include
duplicate entity instances (e.g., two ExtractedEntity objects with the same
text/entity_type/start/end or same text/source) so
extract_cooccurrence_relations receives repeated entity pairs, then assert that
only one relation is produced for that duplicate pair (check length and that
pairs == set(pairs) or that count of that specific pair == 1). Locate this logic
in the test_no_duplicate_relations function and modify the entities list (using
ExtractedEntity) and the assertions accordingly so the deduplication behavior of
extract_cooccurrence_relations is actually validated.
- Around line 13-25: Update the test_two_entities_produce_relation test to
assert the exact confidence value produced by extract_cooccurrence_relations:
compute expected_conf = min(0.9, 0.8) * 0.7 and replace the loose check (0 <
rel.confidence <= 1.0) with an exact assertion rel.confidence == expected_conf;
ensure you reference the ExtractedEntity instances used in the test and the
relation object rel returned by extract_cooccurrence_relations and also add
import pytest at the top of the test file if the test runner or assertion style
requires it.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: d100aa5a-160f-4177-b456-d17123f4de51

📥 Commits

Reviewing files that changed from the base of the PR and between febf02b and 6487081.

📒 Files selected for processing (2)
  • src/brainlayer/pipeline/entity_extraction.py
  • tests/test_kg_relations.py
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: test (3.11)
  • GitHub Check: test (3.13)
  • GitHub Check: test (3.12)
🧰 Additional context used
📓 Path-based instructions (3)
**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

**/*.py: Flag risky DB or concurrency changes explicitly and do not hand-wave lock behavior
Enforce one-write-at-a-time concurrency constraint; reads are safe but brain_digest is write-heavy and must not run in parallel with other MCP work
Run pytest before claiming behavior changed safely; current test suite has 929 tests

Files:

  • tests/test_kg_relations.py
  • src/brainlayer/pipeline/entity_extraction.py
tests/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Use pytest for testing

Files:

  • tests/test_kg_relations.py
src/brainlayer/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

src/brainlayer/**/*.py: Use Python/Typer CLI architecture for the main package in src/brainlayer/
All scripts and CLI must use paths.py:get_db_path() for resolving database path instead of hardcoding
Implement chunk lifecycle management with columns superseded_by, aggregated_into, archived_at; default search excludes lifecycle-managed chunks
Never run bulk database operations while enrichment workers are writing; always stop workers and checkpoint WAL first
Drop FTS triggers before bulk deletes on chunks table and recreate after; batch deletes in 5-10K chunks with checkpoint every 3 batches
Implement retry logic on SQLITE_BUSY errors; each worker must use its own database connection
Use ruff check src/ && ruff format src/ for linting and formatting

Files:

  • src/brainlayer/pipeline/entity_extraction.py
🧠 Learnings (1)
📓 Common learnings
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-03-14T02:20:54.656Z
Learning: Request codex review, cursor review, and bugbot review for BrainLayer PRs
🔇 Additional comments (3)
src/brainlayer/pipeline/entity_extraction.py (2)

500-531: co_occurs_with will be normalized to related_to downstream.

Per kg_extraction.py lines 99-102, any relation_type not in CANONICAL_RELATION_TYPES is normalized to "related_to". The set at lines 31-41 does not include "co_occurs_with".

If the semantic distinction matters (e.g., for filtering or weighting co-occurrence edges differently), add "co_occurs_with" to CANONICAL_RELATION_TYPES. If all you need is some edge, the current behavior is fine—just note that co_occurs_with becomes related_to after validation.

[raise_major_issue, request_verification]

#!/bin/bash
# Verify CANONICAL_RELATION_TYPES contents
rg -n "CANONICAL_RELATION_TYPES" src/brainlayer/pipeline/kg_extraction.py -A 12

580-582: Integration looks correct—co-occurrence runs on deduplicated entities.

Running after entity deduplication ensures relations reference the canonical entity text. The relations are appended to all_relations, preserving any LLM-extracted relations when use_llm=True.

tests/test_kg_relations.py (1)

58-69: Integration test validates the core PR objective.

This test confirms that extract_entities_combined produces co_occurs_with relations without an LLM, which is the main fix described in the PR objectives.

Comment on lines +513 to +520
for i, a in enumerate(entities):
for b in entities[i + 1 :]:
if a.entity_type == b.entity_type:
continue
pair = (a.text, b.text) if a.text < b.text else (b.text, a.text)
if pair in seen:
continue
seen.add(pair)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Potential quadratic blowup for large entity lists.

For n entities, this produces up to n*(n-1)/2 relations. With 50+ entities in a chunk, you get 1000+ relations—most of which may be noise.

Consider capping output (e.g., top-k by confidence) or limiting to entities above a confidence threshold. This is a minor concern for typical chunk sizes but could degrade performance on outlier chunks.

🛡️ Optional: Add a cap on relation count
+MAX_COOCCURRENCE_RELATIONS = 50  # Cap to avoid noise explosion
+
 def extract_cooccurrence_relations(entities: list[ExtractedEntity]) -> list[ExtractedRelation]:
     ...
     relations: list[ExtractedRelation] = []
     seen: set[tuple[str, str]] = set()
 
     for i, a in enumerate(entities):
         for b in entities[i + 1 :]:
+            if len(relations) >= MAX_COOCCURRENCE_RELATIONS:
+                break
             if a.entity_type == b.entity_type:
                 continue
             ...
+        if len(relations) >= MAX_COOCCURRENCE_RELATIONS:
+            break
 
     return relations
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/brainlayer/pipeline/entity_extraction.py` around lines 513 - 520, The
nested pairwise loop in entity_extraction.py (the for i, a in
enumerate(entities): for b in entities[i + 1 :] loop that builds relations and
uses seen) can blow up to O(n^2); before generating pairs filter and/or cap
candidates (e.g., filter entities by a confidence threshold field like
entity.confidence, or limit to top-K entities by confidence), then produce at
most max_relations_per_chunk relations (stop generating once count reaches the
cap). Implement this in the code around the loop that builds pair and seen:
first create a filtered_entities list (apply threshold or take top-K by
confidence), then run the pairwise loop on that list and break early when the
relation count reaches max_relations_per_chunk; make max_relations_per_chunk and
confidence threshold configurable constants or function parameters.

for b in entities[i + 1 :]:
if a.entity_type == b.entity_type:
continue
pair = (a.text, b.text) if a.text < b.text else (b.text, a.text)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Lexicographic ordering uses raw text casing—may cause inconsistent deduplication.

a.text < b.text compares by Unicode codepoint, so "Apple" vs "apple" yields different orderings. If the same entity appears with different casing from different sources (e.g., "SQLite" vs "sqlite"), they won't dedupe correctly.

Consider normalizing to lowercase for the pair key:

♻️ Proposed fix
-            pair = (a.text, b.text) if a.text < b.text else (b.text, a.text)
+            key_a, key_b = a.text.lower(), b.text.lower()
+            pair = (key_a, key_b) if key_a < key_b else (key_b, key_a)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
pair = (a.text, b.text) if a.text < b.text else (b.text, a.text)
key_a, key_b = a.text.lower(), b.text.lower()
pair = (key_a, key_b) if key_a < key_b else (key_b, key_a)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/brainlayer/pipeline/entity_extraction.py` at line 517, The current pair
key uses raw casing (pair = (a.text, b.text) if a.text < b.text else (b.text,
a.text)), which causes inconsistent deduplication for the same entity with
different case; change the comparison to use a case-normalized form (use
Unicode-aware casefold or lowercase) for ordering (e.g., compare a_key =
a.text.casefold() and b_key = b.text.casefold()), but keep the original
a.text/b.text values for the stored pair if you need to preserve original
casing; update the code that constructs pair in entity_extraction.py to order by
the normalized keys instead of the raw texts.

Comment on lines +10 to +52
class TestCooccurrenceRelations:
"""Rule-based relation extraction from co-occurring entities."""

def test_two_entities_produce_relation(self):
"""Two entities in the same text should produce a co-occurrence relation."""
entities = [
ExtractedEntity(text="BrainLayer", entity_type="project", start=0, end=10, confidence=0.9, source="seed"),
ExtractedEntity(text="SQLite", entity_type="technology", start=20, end=26, confidence=0.8, source="seed"),
]
relations = extract_cooccurrence_relations(entities)
assert len(relations) >= 1
rel = relations[0]
assert rel.source_text == "BrainLayer"
assert rel.target_text == "SQLite"
assert rel.relation_type == "co_occurs_with"
assert 0 < rel.confidence <= 1.0

def test_no_relations_for_single_entity(self):
"""A single entity can't have co-occurrence relations."""
entities = [
ExtractedEntity(text="BrainLayer", entity_type="project", start=0, end=10, confidence=0.9, source="seed"),
]
relations = extract_cooccurrence_relations(entities)
assert len(relations) == 0

def test_no_duplicate_relations(self):
"""Same entity pair should produce at most one relation."""
entities = [
ExtractedEntity(text="Foo", entity_type="project", start=0, end=3, confidence=0.9, source="seed"),
ExtractedEntity(text="Bar", entity_type="technology", start=10, end=13, confidence=0.8, source="seed"),
]
relations = extract_cooccurrence_relations(entities)
pairs = [(r.source_text, r.target_text) for r in relations]
assert len(pairs) == len(set(pairs))

def test_same_type_entities_not_related(self):
"""Entities of the same type shouldn't get co-occurrence relations (too noisy)."""
entities = [
ExtractedEntity(text="Foo", entity_type="project", start=0, end=3, confidence=0.9, source="seed"),
ExtractedEntity(text="Bar", entity_type="project", start=10, end=13, confidence=0.8, source="seed"),
]
relations = extract_cooccurrence_relations(entities)
assert len(relations) == 0
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Consider adding edge case tests.

Missing coverage for:

  • Empty entity list (should return empty list)
  • Three entities of different types (should produce 3 relations)
💚 Additional test cases
def test_empty_entities_returns_empty(self):
    """Empty input should return empty relations."""
    relations = extract_cooccurrence_relations([])
    assert relations == []

def test_three_different_types_produce_three_relations(self):
    """N entities of different types produce N*(N-1)/2 relations."""
    entities = [
        ExtractedEntity(text="A", entity_type="project", start=0, end=1, confidence=0.9, source="seed"),
        ExtractedEntity(text="B", entity_type="technology", start=5, end=6, confidence=0.8, source="seed"),
        ExtractedEntity(text="C", entity_type="person", start=10, end=11, confidence=0.7, source="seed"),
    ]
    relations = extract_cooccurrence_relations(entities)
    assert len(relations) == 3  # A-B, A-C, B-C
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/test_kg_relations.py` around lines 10 - 52, Add two edge-case tests to
TestCooccurrenceRelations: implement test_empty_entities_returns_empty which
calls extract_cooccurrence_relations([]) and asserts an empty list (relations ==
[]), and implement test_three_different_types_produce_three_relations which
creates three ExtractedEntity instances of distinct entity_type values and
asserts len(extract_cooccurrence_relations(entities)) == 3 to verify the
N*(N-1)/2 co-occurrence behavior; reference the existing ExtractedEntity and
extract_cooccurrence_relations symbols when adding these tests.

Comment on lines +13 to +25
def test_two_entities_produce_relation(self):
"""Two entities in the same text should produce a co-occurrence relation."""
entities = [
ExtractedEntity(text="BrainLayer", entity_type="project", start=0, end=10, confidence=0.9, source="seed"),
ExtractedEntity(text="SQLite", entity_type="technology", start=20, end=26, confidence=0.8, source="seed"),
]
relations = extract_cooccurrence_relations(entities)
assert len(relations) >= 1
rel = relations[0]
assert rel.source_text == "BrainLayer"
assert rel.target_text == "SQLite"
assert rel.relation_type == "co_occurs_with"
assert 0 < rel.confidence <= 1.0
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Test should verify the confidence calculation.

The test checks 0 < rel.confidence <= 1.0 but doesn't verify the expected value min(0.9, 0.8) * 0.7 = 0.56. Adding an exact assertion would catch regressions in the confidence formula.

💚 Proposed enhancement
         assert 0 < rel.confidence <= 1.0
+        # min(0.9, 0.8) * 0.7 = 0.56
+        assert rel.confidence == pytest.approx(0.56)

Add import pytest at the top of the file.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/test_kg_relations.py` around lines 13 - 25, Update the
test_two_entities_produce_relation test to assert the exact confidence value
produced by extract_cooccurrence_relations: compute expected_conf = min(0.9,
0.8) * 0.7 and replace the loose check (0 < rel.confidence <= 1.0) with an exact
assertion rel.confidence == expected_conf; ensure you reference the
ExtractedEntity instances used in the test and the relation object rel returned
by extract_cooccurrence_relations and also add import pytest at the top of the
test file if the test runner or assertion style requires it.

Comment on lines +35 to +43
def test_no_duplicate_relations(self):
"""Same entity pair should produce at most one relation."""
entities = [
ExtractedEntity(text="Foo", entity_type="project", start=0, end=3, confidence=0.9, source="seed"),
ExtractedEntity(text="Bar", entity_type="technology", start=10, end=13, confidence=0.8, source="seed"),
]
relations = extract_cooccurrence_relations(entities)
pairs = [(r.source_text, r.target_text) for r in relations]
assert len(pairs) == len(set(pairs))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Test doesn't fully verify deduplication—only checks uniqueness of returned pairs.

This test passes even if duplicates were never possible in the first place. To properly test deduplication, provide duplicate entity pairs (e.g., same text appearing twice in the input list) and verify only one relation is produced.

💚 Proposed fix to test actual deduplication logic
     def test_no_duplicate_relations(self):
-        """Same entity pair should produce at most one relation."""
+        """Duplicate entity pairs in input should produce only one relation."""
         entities = [
             ExtractedEntity(text="Foo", entity_type="project", start=0, end=3, confidence=0.9, source="seed"),
             ExtractedEntity(text="Bar", entity_type="technology", start=10, end=13, confidence=0.8, source="seed"),
+            # Duplicate mention of same pair
+            ExtractedEntity(text="Foo", entity_type="project", start=20, end=23, confidence=0.85, source="gliner"),
+            ExtractedEntity(text="Bar", entity_type="technology", start=30, end=33, confidence=0.75, source="gliner"),
         ]
         relations = extract_cooccurrence_relations(entities)
-        pairs = [(r.source_text, r.target_text) for r in relations]
-        assert len(pairs) == len(set(pairs))
+        # Should only produce one Foo-Bar relation despite 4 possible pairings
+        foo_bar_relations = [r for r in relations if {r.source_text, r.target_text} == {"Foo", "Bar"}]
+        assert len(foo_bar_relations) == 1
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def test_no_duplicate_relations(self):
"""Same entity pair should produce at most one relation."""
entities = [
ExtractedEntity(text="Foo", entity_type="project", start=0, end=3, confidence=0.9, source="seed"),
ExtractedEntity(text="Bar", entity_type="technology", start=10, end=13, confidence=0.8, source="seed"),
]
relations = extract_cooccurrence_relations(entities)
pairs = [(r.source_text, r.target_text) for r in relations]
assert len(pairs) == len(set(pairs))
def test_no_duplicate_relations(self):
"""Duplicate entity pairs in input should produce only one relation."""
entities = [
ExtractedEntity(text="Foo", entity_type="project", start=0, end=3, confidence=0.9, source="seed"),
ExtractedEntity(text="Bar", entity_type="technology", start=10, end=13, confidence=0.8, source="seed"),
# Duplicate mention of same pair
ExtractedEntity(text="Foo", entity_type="project", start=20, end=23, confidence=0.85, source="gliner"),
ExtractedEntity(text="Bar", entity_type="technology", start=30, end=33, confidence=0.75, source="gliner"),
]
relations = extract_cooccurrence_relations(entities)
# Should only produce one Foo-Bar relation despite 4 possible pairings
foo_bar_relations = [r for r in relations if {r.source_text, r.target_text} == {"Foo", "Bar"}]
assert len(foo_bar_relations) == 1
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/test_kg_relations.py` around lines 35 - 43, The test
test_no_duplicate_relations doesn't exercise deduplication because it supplies
only unique entities; update it to include duplicate entity instances (e.g., two
ExtractedEntity objects with the same text/entity_type/start/end or same
text/source) so extract_cooccurrence_relations receives repeated entity pairs,
then assert that only one relation is produced for that duplicate pair (check
length and that pairs == set(pairs) or that count of that specific pair == 1).
Locate this logic in the test_no_duplicate_relations function and modify the
entities list (using ExtractedEntity) and the assertions accordingly so the
deduplication behavior of extract_cooccurrence_relations is actually validated.

@EtanHey EtanHey merged commit 0fee1e3 into main Apr 2, 2026
6 checks passed
@EtanHey EtanHey deleted the fix/kg-relations-stub-cleanup branch April 2, 2026 00:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant