Skip to content

Surface OC_SUBTREE_GROUP relationships in semantic search (#1645)#1669

Merged
JSv4 merged 25 commits into
mainfrom
claude/vector-search-relationships-d6CAR
May 17, 2026
Merged

Surface OC_SUBTREE_GROUP relationships in semantic search (#1645)#1669
JSv4 merged 25 commits into
mainfrom
claude/vector-search-relationships-d6CAR

Conversation

@JSv4
Copy link
Copy Markdown
Collaborator

@JSv4 JSv4 commented May 15, 2026

Summary

Implements end-to-end vector search over materialised OC_SUBTREE_GROUP relationships, enabling semantic search to return entire document blocks (not just individual annotations) and deep-link the document viewer directly to them.

Key Changes

Backend

  • New CoreRelationshipVectorStore (core_relationship_vector_store.py): Mirrors CoreAnnotationVectorStore but searches the polymorphic Embedding.relationship slot. Enforces the same visibility/IDOR model via Relationship.objects.visible_to_user() and scopes by corpus/document.

  • Relationship embedding pipeline (embeddings_task.py):

    • synthesize_relationship_block_text(): Concatenates source + target annotation text (newline-separated, capped at SUBTREE_GROUP_BLOCK_TEXT_MAX_CHARS) — same string the embedder sees, so GraphQL clients can render snippets without re-fetching.
    • calculate_embeddings_for_relationship_batch(): Dual-embedding task (default + corpus-preferred embedder) for subtree groups, dispatched automatically by the materialiser.
  • Block-context augmentation (core_vector_stores.py):

    • New BlockContext dataclass: Surfaces the smallest enclosing OC_SUBTREE_GROUP for annotation hits.
    • _attach_block_context_sync(): Joins annotation hits against materialised subtree groups, picks the smallest enclosing block per hit, and attaches it to VectorSearchResult.block_context.
  • Relationship embedding support (annotations/models.py):

    • Relationship now inherits HasEmbeddingMixin so subtree groups can be embedded.
    • get_embedding_reference_kwargs() wires the polymorphic Embedding.relationship FK.
  • Database schema (migrations/0073_embedding_relationship.py):

    • New nullable Embedding.relationship FK + partial unique constraint for polymorphic embedding storage.
  • Subtree materialiser integration (utils/subtree_groups.py):

    • _dispatch_relationship_embeddings(): Enqueues embedding tasks for fresh subtree groups outside the atomic block (idempotent upsert prevents double-embedding on retries).
  • GraphQL API (config/graphql/):

    • New BlockContextType GraphQL type with relationship ID, source/target annotation IDs, and bounded block text.
    • SemanticSearchResultType.block_context field: Populated post-hoc for annotation hits inside a subtree.
    • New semantic_search_relationships query: Returns SemanticSearchRelationshipResult hits ranked by cosine similarity, scoped by corpus/document.

Frontend

  • Deep-linking hook (useJumpToRelationship.ts): Wires the URL ?rel=<pk> parameter to select the relationship and scroll the source annotation into view.

  • Relationship selection UI (CorpusAnnotationCards.tsx): Maps annotation IDs to containing relationship IDs so clicks can deep-link the doc viewer to the block (not just the leaf).

  • Route integration (CentralRouteManager.tsx, navigationUtils.ts): Adds rel query parameter support for relationship deep-links.

  • GraphQL types (frontend/src/graphql/queries.ts): New BlockContextPayload and SemanticSearchRelationshipResult interfaces.

Notable Implementation Details

  • Visibility enforcement: Both stores use visible_to_user() + Q-OR filters to handle structural relationships (anchored via StructuralAnnotationSet) and non-structural rows uniformly — same pattern as CoreAnnotationVectorStore.

  • Idempotent embedding: The materialiser's delete-then-insert pattern keeps relationship PKs stable; dual-embedding tasks use add_embedding()'s upsert logic to short-circuit unchanged inputs.

  • Block-text consistency: synthesize_relationship_block_text() is shared between the embedder (input)

…loses #1645)

Two-front delivery on the materialised OC_SUBTREE_GROUP rows from #1646:

(a) Annotation hits now carry a block_context for the smallest enclosing
    subtree-group. CoreAnnotationVectorStore._attach_block_context_sync
    runs a single join after reranking, picks the smallest containing
    OC_SUBTREE_GROUP per hit, and returns a bounded
    SUBTREE_GROUP_BLOCK_TEXT_MAX_CHARS-capped block_text. Wired into
    search/async_search/hybrid_search/async_hybrid_search/global_search
    and surfaced through PydanticAIVectorSearchResponse and the
    semanticSearch GraphQL resolver.

(b) Relationships are first-class vector targets. Polymorphic
    Embedding.relationship FK (migration 0073) + partial-unique
    constraint, Relationship now mixes in HasEmbeddingMixin, new
    calculate_embeddings_for_relationship_batch task dispatched by
    build_subtree_groups_for_document (dual-embedding strategy: default
    + corpus-preferred). New CoreRelationshipVectorStore runs cosine
    distance against the Embedding table, scoped by
    Relationship.objects.visible_to_user. New semanticSearchRelationships
    GraphQL query returns relationship pk, source/target annotation pks,
    label, block_text, and document/corpus context.

Doc-viewer jump-to: ?rel=<relationship_pk> URL param parsed by
CentralRouteManager into a new selectedRelationshipId reactive var.
useJumpToRelationship hook (mounted in DocumentKnowledgeBase) watches
the param + allRelationsAtom and, once relations are loaded, selects
the relation and scrolls its source annotation into view.
CorpusAnnotationCards click handler now forwards block_context.
relationship_id as a relationshipId query param so users landing from
a leaf-annotation hit also see the containing block selected.

Permissioning mirrors resolve_semantic_search: empty-list response for
missing-or-denied document/corpus prevents IDOR enumeration; corpus-
scoped queries derive the embedder from Corpus.preferred_embedder so
relationship vectors stay in the corpus's frozen embedding space.

Tests cover smallest-enclosing-group selection, root-level pass-through,
block_text truncation, end-to-end relationship retrieval, IDOR denial,
and the rel= URL-builder round-trip.
Comment thread opencontractserver/tasks/embeddings_task.py Fixed
@claude
Copy link
Copy Markdown

claude Bot commented May 15, 2026

Code Review — PR #1669: Surface OC_SUBTREE_GROUP relationships in semantic search

Overall this is a well-structured, well-documented PR with solid security hygiene. The IDOR model mirrors the annotation store exactly, the migration is correct, and the test coverage hits the meaningful paths. The comments below are mostly about one DRY violation, one potential front-end bug, and a handful of minor issues.


🐛 Potential Bug — useJumpToRelationship.ts: raw PK vs. Relay ID mismatch

// useJumpToRelationship.ts
const match: RelationGroup | undefined = allRelations.find(
  (r) => r.id === relId   // relId is a raw Django PK string, e.g. "42"
);

relId comes from selectedRelationshipId → URL ?rel=<pk>, which carries a raw Django PK (as documented everywhere in this PR). However, allRelationsAtom is populated from GraphQL data where Relationship.id is a Relay global ID (e.g. "UmVsYXRpb25zaGlwOjQy"). If RelationGroup.id is a Relay global ID, this find will never match, making the entire hook a silent no-op.

Please verify that RelationGroup.id (in frontend/src/components/annotator/types/annotations.ts) uses raw PKs, not Relay IDs. If it uses Relay IDs, the hook needs to decode with fromGlobalId(r.id)[1] === relId before comparing.


🧹 DRY Violation — block text construction implemented three times

The CHANGELOG says synthesize_relationship_block_text() is "shared between the embedder and the GraphQL surface," but the code tells a different story:

  1. embeddings_task.pysynthesize_relationship_block_text() (the canonical function)
  2. core_relationship_vector_store.py:_shape_results() — inline bounded-concat, ~25 lines
  3. core_vector_stores.py:_attach_block_context_sync() — inline bounded-concat, ~20 lines

Neither vector store actually calls synthesize_relationship_block_text. This is a clear CLAUDE.md DRY violation and creates a divergence risk — if the cap logic ever needs to change, all three must be updated.

Fix: import synthesize_relationship_block_text (or a lower-level _build_block_text(parts, max_chars) helper) in both vector stores and call it instead of repeating the loop.


⚡ Performance — N+1 in calculate_embeddings_for_relationship_batch

# embeddings_task.py
relationships = list(
    Relationship.objects.filter(pk__in=relationship_ids)
    .prefetch_related("source_annotations", "target_annotations")   # ← no order_by
)

# Later, inside _apply_dual_embedding_strategy → synthesize_relationship_block_text:
relationship.source_annotations.order_by("id").values_list(...)    # ← cache miss
relationship.target_annotations.order_by("id").values_list(...)    # ← cache miss

prefetch_related without order_by("id") doesn't share its cache with the ordered queryset inside synthesize_relationship_block_text, so each call re-hits the DB for both M2M sides — 2 extra queries per relationship. For a document that materialises 200 subtree groups this is 400 extra queries in one Celery task.

Fix: Either pre-sort the prefetch (Prefetch("source_annotations", queryset=Annotation.objects.order_by("id"))) or pass the already-fetched lists into synthesize_relationship_block_text to avoid re-fetching.


⚡ Performance — extra visible_qs.exists() check in search()

# core_relationship_vector_store.py
def search(self, query):
    visible_qs = self._build_visible_relationship_qs(query.label_texts)
    if not visible_qs.exists():   # ← separate COUNT query
        return []
    ...

_build_visible_relationship_qs already returns Relationship.objects.none() for all denial cases. The .exists() call fires an extra COUNT before the actual cosine-distance query. Just remove the check — _run_vector_search will return [] for an empty queryset anyway.


🧪 Test isolation issue — test_block_text_is_bounded mutates shared data

# test_subtree_group_vector_search.py — BlockContextAttachTestCase
def test_block_text_is_bounded(self) -> None:
    big = "x" * (SUBTREE_GROUP_BLOCK_TEXT_MAX_CHARS + 100)
    self.leaf.raw_text = big
    self.leaf.save(update_fields=["raw_text"])   # mutates shared setUpTestData state

setUpTestData wraps data in a class-level savepoint, but mutations inside a test method (save()) are only rolled back after the individual test (within the TestCase transaction). Other tests in the same class that run after this one will see the inflated leaf.raw_text if Django's test runner changes ordering. Use setUp() with per-test teardown, or save and restore the original value.


🔍 Minor Issues

SEMANTIC_SEARCH_RELATIONSHIPS query has no UI consumer
The query constant and TypeScript types are added to queries.ts but no component calls them yet. If this is intentional (backend-first, UI follow-up), a comment or TODO would make intent clear.

getattr(r, "similarity_score", 1.0) in _shape_results
_run_vector_search explicitly sets r.similarity_score in the loop before returning, so the getattr fallback is never reached. Just use r.similarity_score.

BlockContextType and SemanticSearchRelationshipResultType use raw Django PKs
The inline docstrings explain the non-Relay-ID choice clearly; that's fine. Worth adding a note to the API docs or schema description that clients must NOT pass these to from_global_id — the existing inline docs do this, but only in Python-side comments. The GraphQL description strings could also warn clients directly.

Constructor over-validation in CoreRelationshipVectorStore.__init__
The RuntimeError on the inner if corpus_id is None branch can never be reached because the outer check already raises ValueError when both are None. The inner guard is dead code.

hoveredAnnotationId not cleared on relationship deselection
useJumpToRelationship clears selectedRelations when relId becomes null, but doesn't clear hoveredAnnotationId. A stale hover highlight will remain until the user moves their cursor.


✅ What's well done

  • IDOR model is airtight — both the resolver and the store return [] for denied/absent resources, identical to the annotation path.
  • Idempotent embedding dispatch — calling _dispatch_relationship_embeddings outside the atomic block, with the upsert in store_embedding, is the right pattern.
  • Migration is correct — nullable FK + partial unique constraint on (embedder_path, relationship) matches the model definition exactly.
  • Block context attach timing — done after reranking across all entry-points (search, async_search, hybrid_search, async_hybrid_search, global_search) is the right call.
  • lastAppliedRef in useJumpToRelationship elegantly prevents re-selecting the URL-driven relation on every allRelations mutation.
  • Test coverage hits the meaningful paths: smallest-enclosing-group, IDOR denial, block_text cap, and the round-trip navigation URL contract.

JSv4 added 3 commits May 15, 2026 18:53
- Fix relationship deep-link bug: useJumpToRelationship now compares
  numeric PKs (URL ?rel=<pk> carries raw PK, but RelationGroup.id is
  a Relay global ID), so the find() actually matches.
- Clear hoveredAnnotationId on relationship deselection so the stale
  hover indicator doesn't linger after ?rel= is cleared.
- DRY block-text construction: extract join_block_text_parts() and
  share it between synthesize_relationship_block_text (embedder),
  CoreRelationshipVectorStore._shape_results, and
  CoreAnnotationVectorStore._attach_block_context_sync. Cap/truncation
  logic now lives in one place.
- Fix N+1 in calculate_embeddings_for_relationship_batch: order the
  prefetch by id and read raw_text from prefetched objects when the
  cache is populated.
- Fix smallest-enclosing-block bug: Count("target_annotations") with
  target_annotations__in=hit_ids in the same queryset is restricted by
  the filter join, collapsing every group's descendant_count to 1 and
  forcing a tie-break on lowest pk (i.e. always selecting ROOT instead
  of the innermost enclosing block). Compute descendant_count from the
  prefetched target set instead. Existing test now correctly passes.
- Remove dead visible_qs.exists() check, dead RuntimeError branch in
  CoreRelationshipVectorStore.__init__, and dead getattr fallback for
  similarity_score.
- Fix test isolation: mutating tests now refetch rows so shared
  cls-attributes aren't dirtied for sibling tests.
…tionship hook

Backend (opencontractserver/tests/test_semantic_search_graphql.py):
- New SemanticSearchRelationshipsQueryTest class covering the
  semantic_search_relationships GraphQL resolver: basic dispatch,
  corpus/document scoping via Relay global IDs, IDOR contract
  (invisible corpus/document → empty list), limit clamping, and
  anonymous rejection.
- New SemanticSearchBlockContextFieldTest covering the BlockContext
  field mapping on semanticSearch — exercises the GraphQL type wiring
  introduced alongside OC_SUBTREE_GROUP relationship surfacing.

Frontend (useJumpToRelationship.test.tsx):
- Unit tests pinning the URL-PK ↔ Relay-global-ID bridge that was
  fixed earlier in this PR: a regression test ensures
  ?rel=<raw-pk> resolves against RelationGroup.id (Relay ID) via
  numeric comparison, not string equality.
- Covers no-op paths, non-numeric fallback, hover-clear on
  deselection, and the lastAppliedRef guard.
Comment thread opencontractserver/tasks/embeddings_task.py Fixed
@claude
Copy link
Copy Markdown

claude Bot commented May 16, 2026

Code Review: #1645 — Surface OC_SUBTREE_GROUP in semantic search

This is a well-scoped, well-documented PR. The architecture is consistent, security properties are maintained, and test coverage is solid. A few items below worth addressing before merge.


Overview

The PR builds end-to-end relationship-targeted vector search on top of the materialised OC_SUBTREE_GROUP rows from #1646. The approach is sound: add a polymorphic Embedding.relationship FK, embed subtree groups at ingestion time, surface them through a new CoreRelationshipVectorStore and semanticSearchRelationships GraphQL query, and wire a ?rel=<pk> deep-link into the doc viewer via useJumpToRelationship.


Issues

1. join_block_text_parts belongs in a utility file, not embeddings_task.py

join_block_text_parts is imported by core_vector_stores.py and core_relationship_vector_store.py — neither is a task. Placing a shared pure helper inside a Celery task module inverts the dependency direction (stores importing from tasks) and violates the Single Responsibility principle in CLAUDE.md. It should live in opencontractserver/utils/ or alongside the constants it's tied to.

2. Accessing Django private _prefetched_objects_cache

In synthesize_relationship_block_text (embeddings_task.py):

prefetched = getattr(relationship, "_prefetched_objects_cache", None) or {}
if "source_annotations" in prefetched:
    sources = [a.raw_text or "" for a in src_qs.all()]

_prefetched_objects_cache is a Django internal that has changed shape across versions. The correct public API is to check relationship.source_annotations._result_cache is not None, or simpler: always use order_by("id").values_list() — the DB hit is negligible for a Celery task and removes the fragile introspection entirely.

3. _attach_block_context_sync docstring describes a Count annotation that is not in the code

The docstring says "the M2M join is annotated with descendant_count = Count(target_annotations)". The actual implementation correctly avoids this (explained in inline comments) and uses len(target_ids_in_group) from the prefetch cache instead. The docstring is misleading — update it to match the implementation.

4. Relay global ID / raw PK inconsistency in the GraphQL schema

BlockContextType.relationship_id and all ID fields on SemanticSearchRelationshipResultType (relationship_id, document_id, corpus_id, source_annotation_id, target_annotation_ids) are declared as graphene.ID but carry raw Django PKs, not Relay global IDs. Every other entity across the schema uses Relay IDs. Field descriptions document this, but declaring them as graphene.Int would make the type system self-documenting and prevent client code from accidentally passing them through from_global_id.

5. No retry/scroll when virtualised page has not rendered yet

In useJumpToRelationship.ts:

const ref = targetId ? refs[targetId] : undefined;
if (ref && typeof ref.scrollIntoView === "function") {
  ref.scrollIntoView({ behavior: "smooth", block: "center" });

When the source annotation lives on a page not yet materialised by the virtualised renderer (common for large PDFs on initial load), ref is undefined and the scroll silently does nothing. lastAppliedRef.current is then set to relId so the effect never retries. The deep-link succeeds in selecting the relation but the scroll does nothing. One fix: do not set lastAppliedRef.current until the scroll actually fires.

6. cast masking a type mismatch in calculate_embeddings_for_relationship_batch

embed_func=cast(
    "Callable[[HasEmbeddingMixin, BaseEmbedder, str], bool]",
    _embed_relationship,
),

_embed_relationship takes Relationship, not HasEmbeddingMixin. The cast silences the type checker rather than fixing the mismatch. Either type _apply_dual_embedding_strategy's embed_func parameter to accept a Protocol that Relationship satisfies, or use Any. Low-risk at runtime but misleads future readers.


Minor / Nits

  • _shape_results corpus/document fallback: Structural relationships always have document=NULL and corpus=NULL by design. The fallback to self.corpus_id is reasonable for breadcrumbs but the comment should note it is a hint, not ground truth — a structural set can be shared across multiple corpora.

  • Block-context augmentation on every search path: The extra DB hit is small and bounded, but consider exposing an include_block_context: bool = True flag on VectorSearchQuery so agent tool callers that do not need the context can skip the round-trip.

  • Missing global_search/async_global_search on CoreRelationshipVectorStore: Fine to omit for now, but a TODO comment would make the gap visible.

  • Test gap: No test for the case where multiple hits share the same smallest enclosing group (multiple entries under the same group in best_for_hit). The logic appears correct but a regression test would be useful.


Security

IDOR contract is correctly maintained: both the store (_build_visible_relationship_qs) and the resolver (resolve_semantic_search_relationships) return [] for both missing and denied resources. visible_to_user is applied before any vector scoring so denied objects never enter the result set.


Test Coverage

Coverage is strong across all new components:

  • BlockContextAttachTestCase: smallest-enclosing-group selection, root-level no-context, empty passthrough, block_text truncation
  • RelationshipVectorStoreTestCase: end-to-end embed + retrieve + IDOR denial
  • SynthesizeBlockTextTestCase: text synthesis correctness and truncation
  • SemanticSearchRelationshipsQueryTest: GraphQL dispatch, IDOR, limit cap, anonymous rejection
  • SemanticSearchBlockContextFieldTest: schema wiring guard
  • useJumpToRelationship.test.tsx: hook lifecycle, Relay ID decoding regression, clear-on-null

Summary: Items 1 (utility file placement), 3 (misleading docstring), and 5 (scroll no-op on unrendered pages) should be fixed before merge. Items 2 and 4 are lower urgency but worth a follow-up. The rest are optional improvements.

JSv4 added 2 commits May 16, 2026 08:35
- Move join_block_text_parts and synthesize_relationship_block_text from
  tasks/embeddings_task.py to utils/embeddings.py so the GraphQL-facing
  vector stores no longer depend on a Celery task module (SRP / utility
  placement).
- Drop the _prefetched_objects_cache introspection in
  synthesize_relationship_block_text and always go through
  ``order_by("id").values_list``; the cache shape is a Django internal
  that has shifted across versions, and the two extra value queries are
  negligible for a Celery task path.
- Fix the misleading docstring in _attach_block_context_sync — there is
  no ``Count(target_annotations)`` annotation in the queryset; the
  descendant count is computed in Python from the prefetched ID set
  (as the inline comment explains).
- Make _apply_dual_embedding_strategy generic in the embeddable type
  (TypeVar bound to HasEmbeddingMixin) and drop the cast that was
  forcing _embed_relationship's Relationship-typed signature through
  the HasEmbeddingMixin-shaped parameter.
- In useJumpToRelationship, track scroll application separately from
  selection so the deep-link retries scrolling once the virtualised
  renderer mounts the source annotation's page. Selection is still
  applied exactly once per relId change, so user-driven edits are not
  fought.
- Update test_embedding_manager assertion to match the new error
  message that now lists relationship_id among the valid parent IDs.
- Replace the ``as any`` in useJumpToRelationship.test.tsx with a
  properly-typed AnnotationLabelType literal to keep the any-baseline
  gate green.
Comment thread opencontractserver/utils/embeddings.py Fixed
@claude
Copy link
Copy Markdown

claude Bot commented May 16, 2026

Code Review — PR #1669: Surface OC_SUBTREE_GROUP relationships in semantic search

This is a well-engineered, substantial feature addition. The architecture is sound and the PR description is exemplary. Below is a breakdown of findings across the stack.


Overview

The PR extends the vector search pipeline end-to-end: materialised OC_SUBTREE_GROUP relationships gain an embedding (via a new polymorphic Embedding.relationship FK + Celery task), a new CoreRelationshipVectorStore enables semantic search over them, annotation hits are augmented with their smallest-enclosing block context post-reranking, the GraphQL API surfaces both, and the frontend deep-links the document viewer to the selected block via a new ?rel=<pk> URL parameter and useJumpToRelationship hook.


Strengths

  • IDOR model is consistent and correct. Every entry point (CoreAnnotationVectorStore, CoreRelationshipVectorStore, and both GraphQL resolvers) returns an empty list — never an error — for missing-or-denied resources. The CLAUDE.md pattern is followed precisely.
  • join_block_text_parts is correctly DRY. Centralising the cap/truncation logic in one utility prevents the text assembled for the embedder from drifting from the text surfaced to clients.
  • Block-context attach is bounded after reranking. Calling _attach_block_context_sync after _apply_rerank means the extra DB hop is bounded by the final top_k rather than the oversampled first-stage pool — a good performance decision.
  • Migration is clean. The partial unique constraint (embedder_path, relationship) mirrors the existing per-annotation/per-note constraints and correctly uses condition=Q(relationship__isnull=False) so the constraint doesn't collide with null rows.
  • Test coverage is solid. Smallest-enclosing-group selection, IDOR denial, truncation at the char cap, the Relay-ID-vs-raw-PK regression test in useJumpToRelationship, and the end-to-end relationship vector retrieval path are all covered.
  • _dispatch_relationship_embeddings is called outside the atomic block. The comment correctly explains why — a transient broker failure should not roll back the materialised rows.

Issues

1. synthesize_relationship_block_text is called twice per relationship in the dual-embedding path (minor redundancy)

In calculate_embeddings_for_relationship_batch, the dual-embedding branch passes:

_apply_dual_embedding_strategy(
    obj=rel,
    text=synthesize_relationship_block_text(rel),   # ← first call
    ...
    embed_func=_embed_relationship,
)

But _embed_relationship internally calls synthesize_relationship_block_text(relationship) again (second call). Since the relationships are prefetched, there's no extra DB hit, but it's wasted computation on large batches. The simplest fix is to call synthesize_relationship_block_text once before the loop, cache the result, and pass it to _embed_relationship as a closure or via a different code path. Alternatively, refactor _embed_relationship to accept a pre-computed text string alongside the relationship.

2. Raw Django PKs in the GraphQL relationship_id fields break the Relay global ID convention

BlockContextType.relationship_id and SemanticSearchRelationshipResultType.relationship_id intentionally use raw Django PKs, documented in the field descriptions and in the PR. The project's GraphQL API otherwise uses Relay global IDs everywhere, and to_global_id("RelationshipType", pk) is a trivially cheap operation on the backend. Mixing conventions means:

  • Frontend code must know which fields are raw PKs and which are global IDs — useJumpToRelationship works around this with getNumericIdFromGlobalId on the RelationGroup.id side, which is the more complex decode.
  • Any future code that naively passes a relationship_id from this API to a resolver that calls from_global_id will silently fail or return a wrong row.

Consider encoding these as Relay global IDs (using to_global_id("RelationshipType", pk)) and decoding in the resolver/hook, matching every other GraphQL type in the codebase. The ?rel= URL parameter can still carry the raw PK — just strip the Relay encoding at the resolver boundary.

3. useJumpToRelationship scroll retry may silently stall for virtualized pages

The hook splits selection (lastAppliedSelectionRef) from scroll (lastAppliedScrollRef) specifically because annotation element refs may not be populated when the hook first runs. However, the effect's dependency array is:

[relId, allRelations, setSelectedRelations, setHoveredAnnotationId, annotationElementRefs]

annotationElementRefs is a useRef — its identity is stable regardless of when new pages mount and add entries to the refs map. So when a target annotation's page is virtualized away and later remounted, the effect does not re-run, and lastAppliedScrollRef stays unset indefinitely. The comment acknowledges this gracefully ("selection without scroll is still useful"), but users who land on a deep-link to a page that hasn't rendered yet will silently get no scroll. Consider triggering a re-run by reading the count of registered refs (if that atom exists) or by setting up a retry timer (e.g., 300 ms after selection is applied, try once more).

4. _run_vector_search sets a dynamic attribute on a Django model instance (minor)

r.similarity_score = max(0.0, min(1.0, 1.0 - distance))

then later:

similarity_score=r.similarity_score,  # type: ignore[attr-defined]

The type: ignore is a signal that the design is leaking. A small @dataclass intermediate (or using the RelationshipVectorSearchResult immediately without the mutation step) would be cleaner and avoid the suppressed error.

5. Missing tests for CoreRelationshipVectorStore.async_search

The async path wraps search via sync_to_async, but this composition is not directly exercised. If anyone adds async-specific logic in the future, the gap will be invisible. A single test that awaits async_search with the same hand-embedded vector (analogous to test_vector_hit_returns_block_metadata) would close this gap.


Nits / minor observations

  • BlockContextType and SemanticSearchRelationshipResultType live in social_types.py. Since SemanticSearchResultType is already there this is consistent, but a search_types.py would be a clearer home if the types file ever gets refactored.
  • The PR has very dense inline comments throughout — some (e.g., the join_block_text_parts loop explaining the +1 sentinel) are genuinely non-obvious; others describe what the code does rather than why. Per CLAUDE.md convention, consider trimming the purely descriptive ones in a follow-up.
  • calculate_embeddings_for_relationship_batch uses autoretry_for=(Exception,) with no exclusions. This retries on ValueError (malformed IDs, missing embedder) — conditions that will never self-heal. Mirroring the annotation task's retry exclusions (if any) or at minimum excluding ValueError / ImportError would avoid wasted retry cycles.

Summary

The core architecture is correct and well-secured. The main items worth addressing before merge are #2 (Relay ID convention consistency — a correctness/maintainability concern) and #3 (the scroll-retry stall for virtualized pages — a user-visible silent failure on deep-links into large documents). The redundant synthesis call (#1) and the dynamic attribute pattern (#4) are minor but worth fixing while the code is fresh. Everything else is polish.

JSv4 added 2 commits May 16, 2026 10:03
… test

- _embed_relationship: accept optional precomputed_text so the dual-
  embedding strategy only synthesizes the relationship block text
  once per relationship instead of once per embedder pass.
- core_relationship_vector_store._run_vector_search: return
  (Relationship, similarity) tuples instead of mutating a dynamic
  similarity_score attribute on the Django model. Drops the
  type: ignore[attr-defined] suppression at the consumer site.
- useJumpToRelationship: add a defensive 300 ms retry timer for the
  scroll-into-view path so deep-links into virtualized PDF pages
  reliably catch the late-registered annotation ref instead of
  stalling silently if the atom re-render doesn't land.
- Add async_search coverage: delegation-to-sync contract and the
  text-only query rebuild path that flows the regenerated embedding
  back through search.

Note: review item #2 (raw Django PKs vs Relay global IDs on the new
relationship_id GraphQL fields) is intentionally deferred — the
choice is documented on the field descriptors and changing the
convention would cascade across the resolver, hook, and URL routing.
@claude
Copy link
Copy Markdown

claude Bot commented May 16, 2026

Code Review — #1669: Surface OC_SUBTREE_GROUP relationships in semantic search

This is a well-designed PR that adds a solid end-to-end feature. The security model is consistent, the async wrapping is handled correctly, and the core algorithm (smallest-enclosing-group selection via a single batched join) avoids N+1 queries. A few items to address:


DRY Violation — CoreRelationshipVectorStore (CLAUDE.md requirement)

CoreRelationshipVectorStore explicitly documents that it mirrors CoreAnnotationVectorStore, and the inline comment in core_relationship_vector_store.py even explains why the code wasn't consolidated ("RelationshipManager is not built from a queryset class today and we don't want to refactor that for this issue"). Per CLAUDE.md's DRY principle, the following patterns are duplicated across both stores and should be extracted:

  • User lookup (User.objects.get(pk=self.user_id) + the not-found early return)
  • IDOR check pattern (document visibility → return Relationship.objects.none(), corpus visibility → same)
  • async_search: the sync_to_async(self.search)(query) wrapper with the embedding-generation preamble is nearly identical in both classes
  • _agenerate_query_embedding (or an equivalent method) is duplicated

A base class or shared mixin (e.g., BaseVectorStore) with _resolve_user, _check_document_idor, _check_corpus_idor, and async_search would eliminate ~100 lines. This is worth doing before merge rather than accumulating more divergence.


Comment Policy Violations (CLAUDE.md)

CLAUDE.md is explicit: "Never write multi-paragraph docstrings or multi-line comment blocks — one short line max."

_attach_block_context_sync has a 22-line docstring explaining implementation details that belong in a PR description, not the source file. Similarly, several inline blocks in core_relationship_vector_store.py exceed one line. The why for non-obvious choices (prefetch vs. annotated COUNT, tie-break on PK) is valuable — but a single terse line is enough:

# prefetch target IDs in Python to preserve tie-break on pk (COUNT collapses it)

store_embedding — No Exactly-One Guard

# shared/Managers.py
if not any([document_id, annotation_id, note_id, conversation_id, message_id, relationship_id]):
    raise ValueError(...)

The guard checks at least one but not exactly one. A caller could pass both annotation_id=1 and relationship_id=2 and the function would accept it, potentially writing an Embedding row with two FK columns set. Embedding.clean() should reject this but it's not called from store_embedding directly. This is an existing issue, but the PR extends the list — a good time to add:

provided = [x for x in [document_id, annotation_id, note_id, conversation_id, message_id, relationship_id] if x]
if len(provided) != 1:
    raise ValueError("Must provide exactly one of document_id, annotation_id, ...")

Missing Test: Both corpus_id and document_id Provided

resolve_semantic_search_relationships accepts both corpus_id and document_id. When both are present, both IDOR checks pass independently and the store receives both — filtering relationships to those belonging to both. This is arguably correct behaviour (document in that corpus), but it's undocumented and untested. A test should cover:

  1. Both provided, document is in the corpus → returns results
  2. Both provided, document is not in the corpus → returns []

The existing test_semantic_search_graphql.py::SemanticSearchRelationshipsQueryTest doesn't cover this case.


void useAtomValue; void PdfAnnotations; in Test File

In useJumpToRelationship.test.tsx:

// Silence unused-import warnings — these are used as type assertions only.
void useAtomValue;
void PdfAnnotations;

This is a code smell. void expr is not a recognized pattern for suppressing unused-import warnings in TypeScript/ESLint — it evaluates the expression for side effects and discards the result, which is unusual for imports. The right approaches:

  • If the imports are truly unused: remove them
  • If they're needed for type checking: use // eslint-disable-next-line @typescript-eslint/no-unused-vars with a comment explaining why
  • If they're needed to register side effects: document that explicitly

synthesize_relationship_block_text — Target Ordering by ID, Not Document Position

# embeddings.py (inferred from description)
for ann in sorted(targets, key=lambda a: a.id):
    ...

Targets are ordered by database ID, which roughly corresponds to insertion order but is not guaranteed to match document position (especially after re-parses where IDs are preserved but ordering may shift). For embedding quality, ordering by document position (e.g., page + bounding box top) would produce more coherent block text for the embedding model. At minimum, the ordering assumption should be documented so future maintainers understand the invariant being relied upon.


Minor: document_id/corpus_id as Raw PKs in SemanticSearchRelationshipResultType

relationship_id = graphene.ID(
    description="... raw Django PK ... NOT a global Relay ID ..."
)

The rest of the GraphQL schema uses Relay global IDs. Returning raw PKs here is intentional (documented in the field description) and the resolver description explains the client-side to_global_id roundtrip, but it creates an inconsistency that will require callers to know which fields are raw PKs and which are Relay IDs. Consider whether this warrants a follow-up to normalize, or at least a schema-level note that documents the mix.


What's Done Well

  • IDOR protection: both the resolver and the store independently return empty results for missing-or-denied resources — no information leakage via error message differences
  • Async wrapping: sync_to_async(self.search)(query) correctly wraps the entire synchronous ORM pipeline; no bare sync DB calls in async paths
  • _attach_block_context_sync query efficiency: single batched join + one IN query for annotation fetch — no N+1
  • Idempotent embedding dispatch: tasks are enqueued outside the atomic block; upsert logic in add_embedding prevents double-embedding on Celery retries
  • Relay ID decoding in useJumpToRelationship: the hook correctly handles the raw PK in the URL vs. Relay global IDs in Apollo cache, with a regression test pinning the non-obvious conversion
  • Constants file compliance: SUBTREE_GROUP_BLOCK_TEXT_MAX_CHARS, OC_SUBTREE_GROUP_LABEL_NAME, etc. all land in the constants module per CLAUDE.md
  • Test coverage for the core algorithm: 4-level tree, root-level hit → None, truncation, IDOR denial

Summary: The feature logic is sound and the security model is correct. The blocking items before merge are (1) the DRY violation in the two vector store classes and (2) the comment-length violations. The store_embedding guard and the missing combined-corpus+document test are smaller but worth addressing in this PR.

Extract shared vector-store machinery into BaseVectorStore (user
resolution, IDOR checks, embedder resolution, query-embedding
generation) so CoreAnnotationVectorStore and CoreRelationshipVectorStore
share the permission-filtering contract instead of duplicating it.

Also:
- Tighten EmbeddingManager.store_embedding to require exactly-one
  parent FK (previously any/all were silently accepted).
- Trim multi-paragraph docstrings on _attach_block_context_sync,
  BlockContext, VectorSearchResult, join_block_text_parts and
  synthesize_relationship_block_text per CLAUDE.md.
- Document the ordering invariant (by id, not document position) in
  synthesize_relationship_block_text.
- Remove the void useAtomValue / void PdfAnnotations smell in
  useJumpToRelationship.test.tsx — the imports were genuinely unused.
- Add tests: store_embedding rejects multiple parent FKs; both
  corpus_id + document_id provided where the document isn't in the
  corpus returns [].

Update test patches to target generate_embeddings_from_text /
get_embedder at their new home in base_vector_store.
if running == 0:
if len(chunk) >= max_chars:
parts.append(chunk[:max_chars])
running = max_chars
@claude
Copy link
Copy Markdown

claude Bot commented May 16, 2026

Code Review — PR #1669: Surface OC_SUBTREE_GROUP relationships in semantic search

Overview

This is a well-architected feature addition that wires the materialised OC_SUBTREE_GROUP rows into vector search end-to-end: new CoreRelationshipVectorStore, BlockContext augmentation on annotation hits, a new GraphQL query + types, and a frontend deep-link hook. The code is consistent, security-conscious, and ships with meaningful test coverage. A few items to address before merge:


Issues

1. Prefetch wasted in calculate_embeddings_for_relationship_batch (performance)

synthesize_relationship_block_text uses .values_list() on the M2M managers, which bypasses Django's prefetch cache (which stores model instances, not ValueQuerySet results). The Prefetch calls at the top of the batch task are therefore completely ignored, resulting in 2 extra queries per relationship in the batch. For N relationships that is 2N avoidable round-trips.

The comment in synthesize_relationship_block_text explains why values_list is used (resilience to prefetch cache shape changes), but the batch task should either drop the Prefetch lines (they are misleading as-is) or pass pre-loaded text to avoid the re-fetch. The precomputed_text parameter already exists in _embed_relationship for exactly this reason — it just is not threaded through synthesize_relationship_block_text itself.

2. Stale comment says "inlined" but the code calls _apply_dual_embedding_strategy

In calculate_embeddings_for_relationship_batch (dual-embedding branch, embeddings_task.py):

Mirrors _apply_dual_embedding_strategy but inlined because the source dict is keyed on Relationship...

The next lines immediately call _apply_dual_embedding_strategy. The comment predates the final implementation and should be updated to something like "Uses _apply_dual_embedding_strategy for the dual-embedding pass...".

3. global_search block-context path not directly tested

CoreAnnotationVectorStore.global_search now calls cls._attach_block_context_sync(reranked), but the global_search test suite does not exercise that path with a materialised subtree group present. The annotation-hit tests in test_subtree_group_vector_search.py call _attach_block_context_sync directly (good), but an actual global_search round-trip with block context would protect this classmethod path against future regressions. Low priority, but worth adding.

4. Relay ID inconsistency — documentation gap

Using raw Django PKs in relationship_id fields on BlockContextType and SemanticSearchRelationshipResultType is a deliberate, well-commented design choice. However, these fields use graphene.ID, which clients typically treat as a Relay global ID. Consider either renaming the fields to e.g. relationshipPk to make the non-Relay convention self-evident at the schema level, or adding an explicit schema description warning clients not to pass these through fromGlobalId. The inline code comments are good; a schema-level note would guard against future SDK code generation stripping those comments.


Minor / Nits

AnonymousContext.user = None in test: Works with Graphene's @login_required today, but AnonymousUser() is the Django-conventional sentinel and would be more robust if the decorator's implementation changes.

Logging style inconsistency in calculate_embeddings_for_relationship_batch: The newer code in this file mixes %s-style and f-string formatting. Worth normalising to one style.


What is Working Well

  • BaseVectorStore extraction cleanly removes duplicated embedder-resolution, user-lookup, and IDOR-check code. The patch-target updates in existing tests (base_vector_store.generate_embeddings_from_text) are correct.
  • IDOR handling is consistent: both new resolvers return [] for missing-or-denied resources, matching the permissioning guide.
  • join_block_text_parts as a shared utility correctly prevents cap/truncation logic from diverging between the embedder input and the GraphQL snippet surface.
  • _attach_block_context_sync runs after reranking so the extra DB join is bounded by the final top-k count, not the oversample pool.
  • useJumpToRelationship correctly bridges raw Django PKs to Relay global IDs via getNumericIdFromGlobalId, and the regression test explicitly pins that bridge (the btoa("Relationship:42") vs "42" case).
  • Migration and model are consistent: the partial-unique constraint appears identically in both models.py and 0073_embedding_relationship.py.
  • Test coverage for block-context logic (smallest-enclosing-group selection, truncation, IDOR denial, synthesis edge cases) is solid.

Summary

The architecture is sound and the security model is correctly applied throughout. Items 1 and 2 should be fixed before merge (wasted prefetch is misleading and the stale comment is inaccurate). Items 3 and 4 are lower priority but worth tracking. Overall this is close to ready.

JSv4 added 5 commits May 16, 2026 14:44
The generate_embeddings_from_text and agenerate_embeddings_from_text functions
are now imported and used in base_vector_store.py (BaseVectorStore parent
class), not in core_vector_stores.py. Update the @patch decorators in
TestHybridSearch to target the base module so mocks bind successfully.
Cover the new code paths surfaced by issue #1645:
- _embed_relationship: empty-text/None-vector/None-embedding short-circuits,
  precomputed_text reuse, and synthesize fallback.
- calculate_embeddings_for_relationship_batch: empty-list early return,
  explicit-embedder load failure, per-relationship outcome counting,
  dual-embedding dispatch when no embedder is supplied, and individual
  failure aggregation.
- PydanticAIVectorSearchResponse: async_from_core_results block_context
  dict-building and label-resolution branches, plus the sync variant.
Items 1 and 2 from the most recent automated review on #1669:

- synthesize_relationship_block_text uses values_list which bypasses
  Django's prefetch cache (it stores instances, not raw column tuples),
  so the prefetch_related calls in calculate_embeddings_for_relationship_batch
  were doing zero work. Drop them and document why; the 2N queries are
  bounded by the small subtree-group cardinality per batch.

- Correct the stale 'inlined' comment in the dual-embedding branch — the
  code calls _apply_dual_embedding_strategy directly; update the comment
  to describe the actual delegation.
@claude
Copy link
Copy Markdown

claude Bot commented May 16, 2026

Code Review: Surface OC_SUBTREE_GROUP Relationships in Semantic Search (#1645)

This is a well-architected feature. The BaseVectorStore extraction, defensive IDOR patterns, and breadth of test coverage are all strong. The issues below are roughly ordered by severity.


Performance

1. N+1 queries in _attach_block_context_sync — High

The block-context attachment loop calls hit_group.target_annotations.all() and hit_group.source_annotations.all() inside a for loop without prefetching, generating one DB round-trip per subtree group hit. The relationship queryset should have .prefetch_related("source_annotations", "target_annotations") applied before the loop.

# core_vector_stores.py – add to the Relationship queryset before the loop
qs = qs.prefetch_related("source_annotations", "target_annotations")

2. N+1 in synthesize_relationship_block_text when called in a batch loop — Medium

synthesize_relationship_block_text() issues two values_list() queries per call (source + target text). In calculate_embeddings_for_relationship_batch() this is called once per relationship in the batch, which is an O(N) query pattern. The comment notes "values_list keeps the helper resilient to prefetch" — but values_list() bypasses the prefetch cache, so prefetching wouldn't help anyway. For small batches (≤100 relationships) this is acceptable; for larger ones a bulk text map would avoid the fan-out. Worth at least a docstring note on expected batch sizes.


Security / IDOR

3. Block-context attachment does not re-apply corpus/document scope — Medium

_attach_block_context_sync filters relationships via visible_to_user(user) but does not re-apply the corpus/document scope from the originating search. Consider an edge case: annotation A is in corpus X (user has access) and is also a target annotation of a subtree group that lives in corpus Y (user has no access). The annotation hit is legitimate, but surfacing corpus Y's subtree group as its block context could leak structure the user shouldn't see.

Suggested fix — carry the scope into the attachment helper:

if self.corpus_id:
    qs = qs.filter(corpus_id=self.corpus_id)
if self.document_id:
    qs = qs.filter(document_id=self.document_id)

4. No explicit is_authenticated guard in semantic_search_relationships resolver — Low

The resolver relies on @login_required to gate anonymous access. If that decorator were accidentally dropped or imported incorrectly, visible_to_user(user) would be called with request.user = AnonymousUser. Consider adding an explicit guard at the top of the resolver body as defence-in-depth (consistent with how other sensitive resolvers are structured):

user = info.context.user
if not user or not user.is_authenticated:
    return []

5. Frontend rel= URL parameter not validated — Low

CentralRouteManager.tsx passes searchParams.get("rel") directly to selectedRelationshipId(). The downstream parseInt(relId, 10) in useJumpToRelationship.ts will produce NaN for non-numeric strings, which is safe today, but explicit validation at the entry point is more robust:

const relParam = searchParams.get("rel");
const relationshipId = relParam && /^\d+$/.test(relParam) ? relParam : null;
selectedRelationshipId(relationshipId);

Consistency / Race Condition

6. Embedding task dispatch outside the atomic block — Low

In subtree_groups.py, _dispatch_relationship_embeddings() is called outside the atomic() block. The comment explains the intent (broker failure shouldn't roll back the rows), but there's an under-documented gap: if the materialiser re-runs after a partial failure, the delete-then-insert creates new rows with fresh PKs, which may orphan embeddings for old PKs. The idempotent upsert in add_embedding() covers the duplicate-enqueue case, but not the stale-PKs-after-reimport case. A fallback background task or a periodic "embed unembedded relationships" scan would close this gap.


Test Coverage

7. No async integration test for CoreRelationshipVectorStore.async_search() — Low

test_subtree_group_vector_search.py tests async_search() via mocking, which validates the wrapper contract but not the actual DB path. TransactionTestCase with async_to_sync (or django.test.IsolatedAsyncioTestCase) would let an integration test hit the real queryset. The annotation store has equivalent async integration tests; parity here would be good.

8. No coverage for empty source-annotation edge case — Low

synthesize_relationship_block_text() handles empty raw_text strings but there is no test for a Relationship that has zero source annotations. The materialiser guarantees at least one, but a malformed row (or future schema change) hitting this path would silently produce block text that consists only of targets. A dedicated test would pin the contract:

def test_synthesize_with_no_source_annotations(self):
    rel.source_annotations.clear()
    text = synthesize_relationship_block_text(rel)
    self.assertEqual(text, "")  # or assert it equals only target text

Minor / Code Quality

9. user_id type coercion not centralised — Minor

BaseVectorStore.__init__ accepts user_id: str | int | None. Downstream code calls User.objects.get(id=self.user_id) — if a string is passed, Django will coerce it via the PK type, but a non-numeric string raises ValueError rather than User.DoesNotExist, which is not caught by the except ObjectDoesNotExist handler. Coercing in __init__ is safer:

self.user_id = int(user_id) if isinstance(user_id, str) and user_id else user_id

Positives Worth Noting

  • The BaseVectorStore abstraction cleanly eliminates the duplication between annotation and relationship stores.
  • Partial unique constraint on Embedding.relationship in the migration is exactly right for polymorphic storage.
  • Defensive empty-list returns for both "not found" and "denied" cases prevent IDOR enumeration.
  • join_block_text_parts() being shared between the embedder and the block-context renderer guarantees consistency between what's indexed and what's surfaced — good design.
  • Test suite is thorough (11+ subtree group search tests, full GraphQL round-trip tests, frontend unit tests for the deep-link hook).
  • CHANGELOG entry is detailed and well-formatted.

@claude
Copy link
Copy Markdown

claude Bot commented May 16, 2026

Code Review: PR #1669 — Surface OC_SUBTREE_GROUP relationships in semantic search

Overall this is a well-designed, comprehensive implementation. The end-to-end wiring from materialised subtree groups through to vector search and frontend deep-linking is coherent, and the design decisions are sound. A few things to call out:


Strengths

  • Idempotent embedding pipeline: Dispatching calculate_embeddings_for_relationship_batch outside the atomic block in build_subtree_groups_for_document is correct — a failed enqueue won't roll back the relationship rows, and the upsert logic short-circuits unchanged inputs on retry. Good design.
  • Embedding/display consistency: synthesize_relationship_block_text() is the single source of truth for both the embedding payload and the GraphQL snippet, eliminating drift between what the vector index "knows" and what clients render.
  • IDOR handling: resolve_semantic_search_relationships returns [] for both "not found" and "no permission", matching the annotation-search pattern exactly. The CoreRelationshipVectorStore._check_idor_sync() mirrors this contract.
  • Lazy subqueries: The CoreRelationshipVectorStore uses lazy subqueries for corpus document IDs rather than materialising large ID lists — important at scale.
  • _attach_block_context_sync efficiency: Single ORM join + one IN-query for annotation text after reranking, not per-result. The hit set is bounded by similarity_top_k (≤200), so the join cost is predictable.
  • Test coverage: Backend suite covers smallest-enclosing-group selection on a 4-level tree, IDOR denial, block text truncation, and the full embedding round-trip. The lastAppliedSelectionRef guard in useJumpToRelationship is also properly unit-tested.

Issues & Suggestions

1. Pagination over a pre-fetched slice — potential correctness gap

In resolve_semantic_search_relationships:

results = store.search(RelationshipVectorSearchQuery(query_text=query, similarity_top_k=limit + offset))
paginated_results = results[offset : offset + limit]

The store is told to fetch limit + offset results and then you slice client-side. This means:

  • Page 1 (offset=0, limit=50): fetches 50, returns 50. ✓
  • Page 2 (offset=50, limit=50): fetches 100, returns results[50:100]. ✓ (but fetches 2× the data)
  • Page 10 (offset=450, limit=50): fetches 500 relationship embeddings from the DB. Expensive.

This is the same pattern used by the annotation resolver, so I understand the consistency argument, but it's worth noting that deep pagination will be costly. The existing SEMANTIC_SEARCH_MAX_RESULTS cap on limit helps, but offset is unbounded. Consider documenting the expected usage pattern (i.e., only shallow pagination) or adding an offset cap.

2. Raw PKs vs Relay global IDs — document the contract explicitly

BlockContextType.relationship_id and all IDs in SemanticSearchRelationshipResultType are raw Django PKs, not Relay global IDs. The field descriptions mention this, which is good, but there's a subtle inconsistency: corpus_id and document_id in SemanticSearchRelationshipResultType are also raw PKs, while everywhere else in the GraphQL schema these are Relay-encoded. If any frontend code tries to pass these through a query that expects a Relay global ID (e.g., node(id: $documentId)), it will silently fail or return null.

Suggestion: either encode all IDs consistently (even in this type), or add a note to SemanticSearchRelationshipResultType's description mirroring what you did for BlockContextType.relationship_id.

3. _attach_block_context_sync is a static method but must be called with permission-scoped results

The method signature is:

@staticmethod
def _attach_block_context_sync(results: list["VectorSearchResult"]) -> list["VectorSearchResult"]:

It queries Relationship.objects without any user-scope filter — it trusts that the incoming results are already permission-filtered by the store. This is correct given the current call sites, but it's an implicit contract that isn't documented. If someone calls this helper on unfiltered results in the future (e.g., admin tooling, a new entry point), it could surface relationship text to the wrong user via the block_text field. A one-line docstring note ("Caller is responsible for ensuring results are already visible_to_user-filtered") would protect this.

4. useJumpToRelationship scroll retry timing

export const JUMP_TO_RELATIONSHIP_SCROLL_RETRY_MS = 300;

300ms may be too aggressive on large or slow-rendering PDFs. The retry fires once, and if the virtualised page still hasn't mounted, the scroll silently fails. The current test suite mocks the DOM so it can't catch timing-sensitive failures. Consider logging a warning when the retry fails too, so production issues are observable:

if (!element) {
  console.warn(`[useJumpToRelationship] annotation ${sourceAnnotationId} not found after retry`);
}

5. Block text truncation not surfaced in UI

block_text is capped at SUBTREE_GROUP_BLOCK_TEXT_MAX_CHARS (16,000 chars), and the GraphQL description says "no further truncation needed." But there's no indication to the user that text has been truncated. For extremely large subtree groups, the displayed snippet will simply end mid-sentence with no ellipsis or "..." indicator. This isn't a correctness issue, but it's a UX gap worth a follow-up issue.

6. Migration ordering assumption

Migration 0073_embedding_relationship.py adds the Embedding.relationship FK. The partial unique constraint name is unique_embedding_per_relationship_embedder. Double-check that this constraint name doesn't collide with any other constraint in the Embedding table — the existing partial constraints (for annotation, document, note, etc.) follow the same naming pattern, so it should be fine, but worth verifying no prior migration used that exact name.


Minor Nits

  • resolve_semantic_search_relationships has a # Lazy import keeps the GraphQL module light comment. This is a project-wide convention, which is fine, but the comment appears both here and in resolve_semantic_search — if they're always going to be used together, consider whether the lazy-import rationale still holds (or just remove the comment per CLAUDE.md conventions on not explaining what the code does).

  • In CoreRelationshipVectorStore, the _build_visible_relationship_qs docstring says "Mirrors CoreAnnotationVectorStore._build_visible_annotation_qs". If the annotation store's method is ever renamed, this docstring becomes stale. Not urgent, but worth noting.

  • The BlockContextType fields use very detailed description= strings (multi-sentence explanations). The project's CLAUDE.md convention is no comments unless the WHY is non-obvious. The "NOTE: this is the raw Django PK" part is genuinely non-obvious and worth keeping; the rest (e.g., "Useful for highlighting the block root") is more of a tutorial. Consider trimming those.


Test Coverage Gaps

  • No test for the resolve_semantic_search_relationships GraphQL resolver via the actual GraphQL schema (only unit-level). A single integration test exercising the full resolver chain (authenticated user, corpus scoping, non-empty results) would increase confidence before this hits production.
  • No frontend test for the CorpusAnnotationCards click path that includes relationshipId — the existing tests likely only cover the no-blockContext branch.

Summary

The implementation is solid. The three items worth addressing before merge are:

  1. Document the raw PK contract on SemanticSearchRelationshipResultType so future callers don't accidentally pass raw PKs to Relay-ID-expecting queries.
  2. Add a docstring to _attach_block_context_sync noting the caller's responsibility for permission-filtered inputs.
  3. Add a GraphQL-level integration test for resolve_semantic_search_relationships.

The pagination cost at large offsets is a known limitation of the existing pattern — worth a follow-up issue rather than a blocker.

JSv4 added 2 commits May 16, 2026 19:22
Resolves CHANGELOG conflict by keeping both entries (subtree-group
semantic search + PII scanner analyzer).
Codecov flagged three files on PR 1669 below patch threshold:

* useJumpToRelationship.ts (75%) — added unit tests for the scroll
  happy path, the source-ref → target-ref fallback, the deferred
  setTimeout retry path, and the malformed-relay-id catch branch.
  Also tightened test isolation: hook instances from prior tests now
  unmount in afterEach so shared scroll-spy calls don't leak across
  cases.

* CorpusAnnotationCards.tsx (30%) — extracted the blockContext
  derivation (annotation id → containing relationship id) and the
  click-handler queryParams builder into a pure
  corpusAnnotationCardsHelpers module so they can be unit-tested
  without mounting the full Apollo + router + reactive-var stack.

* CentralRouteManager.tsx (50%) — added Phase-2 tests pinning that
  ?rel=42 populates selectedRelationshipId and absence clears it.
* Document the raw-PK ID convention on SemanticSearchRelationshipResultType
  so future callers don't accidentally feed those values into resolvers
  expecting Relay global IDs (e.g. node(id: $documentId)).

* Add a SECURITY CONTRACT docstring to _attach_block_context_sync stating
  that it does not apply visible_to_user filtering on its own — callers
  must pass results already permission-filtered by the store's visibility
  queryset, or block_text could leak to users who can't see the underlying
  source/target annotations.

* Remove the 'Lazy import keeps the GraphQL module light' explainer
  comment per the project's no-explain-what conventions; the pattern is
  consistent with resolve_semantic_search and self-evident.
@claude
Copy link
Copy Markdown

claude Bot commented May 17, 2026

Code Review: Surface OC_SUBTREE_GROUP Relationships in Semantic Search (#1645)

This is a well-architected feature that correctly extends the semantic search surface to cover materialised subtree-group relationships. The permission model mirrors the existing annotation-store patterns, the migration is clean and backward-compatible, and the test coverage is thorough. A few items worth addressing before merge:


Security

_attach_block_context_sync — documented contract with no runtime guard

opencontractserver/llms/vector_stores/core_vector_stores.py lines 531-543

The method fetches Annotation.objects.filter(id__in=referenced_ann_ids) without a visible_to_user() filter (line 612-616), relying entirely on the invariant that callers pre-filter their results. The docstring documents this honestly and every current call site is downstream of the permission gate. The risk is a future caller missing the contract.

Suggested mitigation (low-effort): the method already receives the full results list — the user object is available on the store instance. Consider threading it in and asserting, or adding a user_id: int parameter that makes the security requirement explicit at the call signature level rather than in a docstring. Either way, the current approach is correct today, just fragile.


Performance

Annotation text fetch in _attach_block_context_sync uses a second unfiltered Annotation query

Lines 612-616: after picking winning groups, the method runs a new Annotation.objects.filter(id__in=referenced_ann_ids) for block-text assembly. This is correct (the IDs come from already-permission-filtered results), but the queryset fetches raw_text for all source + target annotations of every winning group — potentially more rows than strictly needed if a group has many targets not in the original hit set. The current approach is fine at typical cardinality, but worth noting for documents with very deep subtree trees.

calculate_embeddings_for_relationship_batch — 2 queries per relationship for text synthesis

opencontractserver/tasks/embeddings_task.py line 1139: synthesize_relationship_block_text fires two values_list queries (source M2M + target M2M) per relationship. The code itself documents this trade-off at lines 1078-1086, and the expected cardinality (one group per non-leaf node) keeps the cost reasonable. No action needed — just confirming the trade-off is intentional and well-understood.


Migration

0073_embedding_relationship.py — clean and correct

  • Nullable FK means no table rewrite
  • Partial unique constraint on (embedder_path, relationship) correctly scopes uniqueness to non-NULL rows only
  • CASCADE on delete is appropriate
  • No explicit standalone index needed since the unique constraint creates one
  • Rollback is safe

One minor suggestion: add a comment in the migration (or in Embedding's model docstring) explaining why the constraint is partial — it protects the existing annotation/note/message/document rows that legally carry relationship=NULL. Easy to forget during a future schema audit.


Code Quality

join_block_text_parts — good DRY extraction, minor clarity note

opencontractserver/utils/embeddings.py lines 15-53: the shared cap/truncation helper is the right call for keeping the embedder input and the GraphQL snippet identical. One subtle readability issue: the loop tracks running as a character count but the logic for the first element (if running == 0:) and subsequent elements (running + 1 + len(chunk)) uses the running total differently. Both paths are correct, but an inline comment explaining that +1 accounts for the \n separator would help future readers — particularly since the off-by-one here is easy to introduce silently.

SUBTREE_GROUP_BLOCK_TEXT_MAX_CHARS = 16_000

opencontractserver/constants/annotations.py line 78: the 16k value is intentional (roughly half the context budget of most encoder models), but there is no comment explaining why this number was chosen. Future maintainers changing or tuning the embedder may not know the constraint. A one-line comment tying it to the embedding model's context window would prevent an accidental increase that silently degrades retrieval quality.

Test mock path fixes in test_version_aware_vector_store.py

The 8 @patch target changes from core_vector_stores.generate_embeddings_from_textbase_vector_store.generate_embeddings_from_text are correct: generate_embeddings_from_text now lives in the base and is imported there, so patching the source module is the right target. Good catch.


Test Coverage

The new test_subtree_group_vector_search.py file is well-structured:

  • ✅ Smallest-enclosing-group selection on a 4-level tree
  • ✅ Root-level hit → None
  • ✅ Empty-input pass-through
  • ✅ Block-text truncation at cap
  • ✅ End-to-end relationship vector retrieval
  • ✅ IDOR denial for unauthorized users
  • synthesize_relationship_block_text determinism, cap enforcement, empty-component skipping

Two coverage gaps:

  1. async_search path in CoreRelationshipVectorStore has no dedicated test. The embedding-regeneration path (query_text supplied, no query_embedding) delegates to sync_to_async(self.search) — the happy path is covered implicitly by the resolver test, but the regeneration branch is not explicitly exercised for relationships (compare test_async_search_regenerates_query_embedding_for_text_only in the annotation store tests, which covers this for annotations).

  2. Non-structural OC_SUBTREE_GROUP relationship filtered out: the structural=True filter in _attach_block_context_sync (line 561) correctly blocks analyzer-copied label rows from polluting block context, but no test verifies this boundary. It's a low-probability attack vector today and easy to add later.


Frontend

useJumpToRelationship — correct and robust

The dual lastAppliedSelectionRef / lastAppliedScrollRef pattern correctly handles the virtualized renderer's async page materialisation. The one-shot retry via window.setTimeout is the right pragmatic fallback.

Nit: the cleanup return () => window.clearTimeout(handle) at the end of the effect body is only returned in the retry branch. If the hook's dependencies change (e.g. relId changes while the timer is pending), React will call the cleanup. This is already correct — but it's worth noting the asymmetry: the success path returns undefined implicitly, while the retry path returns a cleanup function. No bug here, but it's a subtle pattern that could confuse a reader.

Raw PK vs Relay ID in deep-link URL

The convention that ?rel=<raw_pk> carries a raw Django integer (not a Relay global ID) is documented in social_types.py and handled correctly by parseInt + getNumericIdFromGlobalId on the consumer side. Worth ensuring this is also called out in the routing-system docs (docs/frontend/routing_system.md) since it's a deliberate divergence from how ?ann= works.


Summary

Overall this is high-quality work. The two most actionable items before merge are:

  1. Add an async relationship vector search test (mirrors the existing annotation store test) — low effort, closes the one visible coverage gap.
  2. Add a one-line comment on SUBTREE_GROUP_BLOCK_TEXT_MAX_CHARS explaining the 16k choice so future embedder changes don't silently regress retrieval.

The security contract in _attach_block_context_sync is correct today and thoroughly documented; the suggestion to make it structurally enforced is a follow-up item, not a blocker.

@JSv4 JSv4 linked an issue May 17, 2026 that may be closed by this pull request
JSv4 added 2 commits May 16, 2026 23:12
…ral decoy

Migration 0073: add inline comment explaining the partial unique constraint —
the condition (relationship IS NOT NULL) protects existing
annotation/note/message/document rows that legally carry relationship=NULL.

Test: add coverage for the structural=True filter in
_attach_block_context_sync. A non-structural OC_SUBTREE_GROUP look-alike must
not contribute block context even when no genuine structural group is
available — guards against analyzer-copied label rows polluting embedded
context.
@claude
Copy link
Copy Markdown

claude Bot commented May 17, 2026

Code Review — PR #1669: Surface OC_SUBTREE_GROUP Relationships in Semantic Search

This is a well-architected end-to-end feature with strong separation of concerns, correct IDOR handling, solid test coverage, and clean migration safety. The key concerns are two correctness issues in the embedding pipeline that warrant attention before merge.


High Priority

B1 — store_embedding validator: truthy check vs. None check, and breaking API contract

File: opencontractserver/shared/Managers.py

The validator was changed from "at least one" to "exactly one" (len(provided) != 1), but the "provided" check uses if x (truthiness), not if x is not None. Django PKs start at 1, so id=0 is practically impossible — but the subtle inconsistency is a latent hazard. More importantly, this is a breaking behavioral change to an internal API: any caller previously passing multiple IDs (annotation + document, for instance) that was previously allowed will now silently raise. All call sites need an explicit audit. In the PR diff, only the new _embed_relationship path calls add_embedding() with just relationship_id, which is correct — but confirm no existing tests regressed.

Suggestion: Use x is not None instead of if x for all slots, and document this as a breaking API change in the changelog.


B2 — Dual-embedding strategy is dead code for relationships

File: opencontractserver/utils/subtree_groups.py, opencontractserver/tasks/embeddings_task.py

_dispatch_relationship_embeddings always passes an explicit embedder_path to calculate_embeddings_for_relationship_batch. Inside the task, when embedder_path is not None, it takes the explicit-embedder branch and skips _apply_dual_embedding_strategy entirely. The corpus_id argument is still passed but has no effect in this branch.

This means the dual-embedding strategy is never exercised for relationship embeddings — it exists in the task but is unreachable from the dispatch function. The test test_dual_embedding_path_invoked_when_no_explicit_embedder does verify the dual path when embedder_path=None, but _dispatch_relationship_embeddings always supplies an explicit one.

Suggestion: Either (a) have the dispatch function pass embedder_path=None for the default embedder slot and let the task invoke the dual strategy, matching the annotation embedding design intent; or (b) remove _apply_dual_embedding_strategy from the relationship task and document that relationships intentionally use single-embedder dispatch. As written, there's a semantic gap between the design comment and the actual execution path.


Medium Priority

S1 — _attach_block_context_sync security invariant is documentation-only

File: opencontractserver/llms/vector_stores/core_vector_stores.py

The docstring correctly states "Callers MUST pass results that have already been permission-filtered." But _attach_block_context_sync is a @staticmethod with no mechanism to enforce this. A future developer calling it on unfiltered annotation hits could inadvertently expose block_text from inaccessible structural groups.

Suggestion: Consider asserting that results came through the standard retrieval pipeline, or accept the user parameter and re-run a lightweight filter check (Relationship.objects.visible_to_user(user).filter(pk__in=group_pks).values_list("pk", flat=True)) as a guard. At minimum, elevate the docstring warning to a more prominent comment near the function signature.


Low Priority / Nits

Q1 — DRY: block_context dict duplicated across sync/async paths
frontend/src/... is fine, but in opencontractserver/llms/vector_stores/pydantic_ai_vector_stores.py, the {"relationship_id": ..., "source_annotation_id": ..., ...} dict is constructed identically in from_core_results and async_from_core_results. Extract a _block_context_to_dict(bc: BlockContext) -> dict helper.

Q2 — corpus_id param to calculate_embeddings_for_relationship_batch is dead when embedder_path is set
See B2. The corpus_id kwarg is accepted and passed but has no effect in the explicit-embedder branch. This should be either used or removed to avoid confusion.

B4 — Implicit ID ordering assumption in test
test_synthesize_joins_source_and_targets_in_id_order asserts "HEAD\nT1\nT2" and relies on t1.id < t2.id. This holds in practice due to sequential auto-increment, but a brief inline comment noting the ordering dependency would help future readers.

S4 — test_anonymous_rejected uses user=None
The test passes user=None to exercise the anonymous rejection path. Confirm the Graphene @login_required decorator in this codebase checks user is None rather than not user.is_authenticated, or the test may not be exercising the real guard.

Q6 — Missing __all__ on core_relationship_vector_store.py
base_vector_store.py defines __all__; the new file does not. Minor inconsistency.


What's Working Well

  • IDOR handling in resolve_semantic_search_relationships is correct — returns [] for both "not found" and "no permission" to prevent enumeration. ✓
  • Migration 0073 is safe: nullable FK + partial unique constraint; no data migration needed; on_delete=CASCADE matches annotation embedding pattern. ✓
  • Test coverage is thorough: block-context selection, IDOR denial, embedding edge cases (empty text, None vector), frontend hook retry/fallback/one-shot guard, GraphQL resolver scoping — all covered. The only notable gap is global_search with subtree groups.
  • block_text capping via SUBTREE_GROUP_BLOCK_TEXT_MAX_CHARS in constants/annotations.py follows the no-magic-numbers convention. ✓
  • join_block_text_parts correctly handles truncation at the character boundary without off-by-one. ✓
  • Frontend deep-link hook (useJumpToRelationship) correctly guards against re-triggering with a lastHandledRef and handles Relay-ID ↔ raw-PK translation. The 11-case test suite is comprehensive. ✓
  • Route integration for rel= is minimal and correctly wired through CentralRouteManager without bypassing the routing system. ✓

Bottom line: Fix B1 (validator semantics + call-site audit) and B2 (dual-embedding dead path), and this is ready to merge. The rest are polish items.

@claude
Copy link
Copy Markdown

claude Bot commented May 17, 2026

Code Review: Surface OC_SUBTREE_GROUP relationships in semantic search (#1645)

This is a well-structured, ambitious PR that cleanly extends the semantic search pipeline end-to-end. The BaseVectorStore extraction is a particularly good DRY improvement. Here are my findings, roughly ordered by severity.


Bugs / Correctness Issues

1. store_embedding() falsy check treats id=0 as "not provided"

opencontractserver/shared/Managers.py

provided = [x for x in (...) if x]   # ← treats 0 as not-provided

Should be if x is not None to be correct for integer IDs. While id=0 never occurs in practice with auto-increment PKs, this will silently miscount if a caller ever passes 0 by accident and should align with the intent.

2. clean() comment contradicts store_embedding() enforcement

opencontractserver/annotations/models.py

# If you want to enforce "exactly one," just check parent_references != 1

This comment says "if you want to enforce exactly one" as if it's optional, but store_embedding() now enforces exactly-one unconditionally. The comment should be removed or updated to match, and clean() itself should enforce the same constraint (right now clean() only rejects zero parents, not multiple).

3. resolve_semantic_search_relationships mixes raw PKs with the Relay-ID API surface

config/graphql/search_queries.py / config/graphql/social_types.py

relationship_id, document_id, corpus_id, source_annotation_id, and target_annotation_ids are returned as raw Django integer PKs. Every other GraphQL entity in this schema is addressed by its Relay global ID. Consumers who try to pass one of these IDs to a standard Relay node query or mutation will get a lookup failure. Either:

  • Wrap the IDs in to_global_id("RelationshipType", pk) / to_global_id("DocumentType", pk) before returning, or
  • Add a clear field-level docstring noting the departure and ensure the frontend never passes them to node resolvers.

The frontend currently uses rel=<raw_pk> in URLs (acknowledged in the comment in navigationUtils.ts), but the discrepancy between rel= and all other URL params (doc=, ann=, etc.) will be a gotcha for future contributors.


Performance Considerations

4. Extra database round-trip inside _build_visible_relationship_qs for each document-scoped search

opencontractserver/llms/vector_stores/core_relationship_vector_store.py

structural_set_id = (
    Document.objects.filter(pk=self.document_id)
    .values_list("structural_annotation_set_id", flat=True)
    .first()
)

This is a separate query that runs on every search() call when document_id is set. It can be inlined with OuterRef / a subquery into the main queryset, or the structural_annotation_set_id can be selected in the visible_qs JOIN. At the moment it's an extra round-trip before the vector scan starts.

5. Potential duplicate rows in _run_vector_search under certain failure modes

opencontractserver/llms/vector_stores/core_relationship_vector_store.py

The query JOINs Relationship → Embedding and annotates cosine distance. The partial unique constraint unique_embedding_per_relationship_embedder prevents true duplicates, but if the constraint is ever violated (e.g., a partially-applied migration or a race condition during dual-embedding), a single relationship could appear twice in the top-k result list. A .distinct() before the annotate or a Subquery approach (as CoreAnnotationVectorStore uses VectorSearchViaEmbeddingMixin) would be fully safe.

6. _attach_block_context_sync assembles block text with an IN query over all referenced annotation IDs

opencontractserver/llms/vector_stores/core_vector_stores.py

For a corpus with deep subtree groups (hundreds of annotations per group) and a full top-k result set, the IN clause over all target annotation IDs could be large. A comment noting the expected upper bound (top_k × avg group size) would help future reviewers gauge whether a pagination strategy is needed.


Code Quality / Consistency

7. social_types.py file name is misleading

config/graphql/social_types.py

BlockContextType and SemanticSearchRelationshipResultType have no relation to "social" features. This is a pre-existing naming issue, but adding more unrelated types here deepens the confusion. Consider a search_types.py companion to search_queries.py.

8. Corpus-id fallback in _shape_results is documented as a hint, not ground truth — but the GraphQL resolver treats it as ground truth

opencontractserver/llms/vector_stores/core_relationship_vector_store.py

# corpus_id is NULL on structural relationships; best-effort
# fallback to scoping context for breadcrumbs / deep-links
# (this is a hint, not ground truth — a structural set can be
# shared across corpora).

If the GraphQL client deep-links using the returned corpus_id, it will land on the corpus used to scope the search, not necessarily the corpus that owns the relationship. This is probably acceptable UX, but the GraphQL type definition and/or resolver docstring should carry the same caveat so frontend engineers don't build logic that assumes the corpus_id is authoritative.

9. clean() on Embedding still allows multiple parents

As noted in point 2, clean() checks parent_references == 0 but doesn't check > 1. Now that store_embedding() enforces exactly-one, the model-level validation is weaker than the manager-level validation. A save via Embedding.save() bypassing the manager can still create a multi-parent row that violates the spirit of the constraints.


Security

10. _attach_block_context_sync skips visible_to_user() — implicit trust documented but not enforced

opencontractserver/llms/vector_stores/core_vector_stores.py

The comment "relies on upstream permission gates" is correct: hit annotation IDs come from visible_to_user()-filtered search results, so the annotation → structural relationship join is safe. The surface area to keep in mind: if any future code path passes hit_annotation_ids from an unfiltered source, this becomes an information-disclosure vector. A defensive assert or an inline comment warning future callers not to bypass the upstream filter would make this intent explicit.


Minor / Nits

  • RelationshipVectorSearchQuery.label_texts defaults to [OC_SUBTREE_GROUP_LABEL_NAME] in the dataclass, but the resolver hardcodes it too. If a second embedded relationship label is ever added, the default would silently widen the results for all callers. Consider an explicit None default with "use all embedded labels" semantics, or remove the default and require callers to be explicit.
  • The async_search in CoreRelationshipVectorStore wraps the whole sync pipeline in sync_to_async. This is correct and the comment explains why, but it means the ORM queries inside _build_visible_relationship_qs run in a thread-pool thread rather than an async context. Worth noting that this matches the pattern already used in CoreAnnotationVectorStore.
  • corpusAnnotationCardsHelpers.ts is a solid extraction of pure functions. The unit tests are well-structured.
  • useJumpToRelationship.test.tsx — 14 test cases including the malformed-Relay-ID case and the re-selection guard. Good coverage.
  • The dual-embedding strategy for relationships correctly threads precomputed text through both embedder passes via functools.partial. Efficient and readable.

Summary

The architecture is sound and the security patterns are consistent with the rest of the codebase. The four items that should be addressed before merge are:

  1. if x is not None instead of if x in store_embedding() (correctness)
  2. Align Embedding.clean() to enforce exactly-one parent (consistency)
  3. Clarify/fix the raw-PK vs Relay-ID inconsistency in the GraphQL return types (API contract)
  4. Carry the corpus_id "hint, not ground truth" caveat through to the GraphQL type or resolver docstring (documentation)

The remaining items are improvements that can be addressed in follow-up issues.

@claude
Copy link
Copy Markdown

claude Bot commented May 17, 2026

Code Review: PR 1669 - Surface OC_SUBTREE_GROUP in semantic search. Full review text below - this is a placeholder while formatting is resolved.

@claude
Copy link
Copy Markdown

claude Bot commented May 17, 2026

Code Review: Surface OC_SUBTREE_GROUP relationships in semantic search

Overview

Well-architected, substantial feature. Thorough, security-conscious implementation with solid test coverage.

Positives

  • Clean DRY refactor: BaseVectorStore extraction eliminates duplication; updated test-patch paths confirm the move was consistent.
  • Security contract documented: The comment on _attach_block_context_sync explaining the visible_to_user omission and the caller invariant is exactly the right approach.
  • Idiomatic idempotency: Delete-then-insert + add_embedding upsert makes re-runs safe without double-embedding.
  • Test breadth: Block-context algorithm, root pass-through, truncation cap, non-structural decoy, IDOR, and Relay-vs-raw-PK regression all covered.
  • join_block_text_parts shared helper: Consistent truncation prevents silent drift between what is indexed and what is returned.

Issue 1 (Medium) — Breaking change in store_embedding

File: opencontractserver/shared/Managers.py line ~1571

Old check: if not any([...]) (zero parents -> error). New check: len(provided) != 1 (zero OR multiple parents -> ValueError). Semantically correct but a breaking change: callers that inadvertently passed two parent IDs previously got a DB constraint violation later; now they get ValueError immediately. Worth grepping all store_embedding call sites before merging.

Issue 2 (Medium) — Raw Django PKs in BlockContextType / SemanticSearchRelationshipResultType

File: config/graphql/social_types.py

relationship_id, source_annotation_id, target_annotation_ids, document_id, and corpus_id are raw Django PKs, breaking the Relay global ID convention used everywhere else. Code-gen tools type these as ID! indistinguishable from Relay-encoded IDs. A future resolver that receives one and calls from_global_id() on it will silently return wrong results (no error, just no data). Consider prefixing as relationship_pk/document_pk, encoding as Relay IDs server-side, or at minimum making the warning visible in GraphQL introspection descriptions rather than only in Python comments.

Issue 3 (Low-Medium) — _attach_block_context_sync security pre-condition is implicit

File: opencontractserver/llms/vector_stores/core_vector_stores.py

The security comment is correct today, but as new entry points are added the pre-condition is easy to forget. Consider accepting a user parameter and applying visible_to_user internally to make the gate mechanical rather than relying on caller knowledge.

Issue 4 (Low) — block_text ordering mismatch for multi-source relationships

File: opencontractserver/llms/vector_stores/core_relationship_vector_store.py, _shape_results

synthesize_relationship_block_text orders sources by ID (order_by id). _shape_results builds ordered_ids from sources[0].id from the unsorted prefetch cache. Not a bug for current single-source OC_SUBTREE_GROUP rows, but latent for future multi-source types. Fix: source_id = min(ann.id for ann in sources) if sources else None.

Issue 5 (Low) — Duplicate block_context dict building

File: opencontractserver/llms/vector_stores/pydantic_ai_vector_stores.py

Identical blocks in from_core_results and async_from_core_results both build the block_context dict. A _block_context_to_dict(bc) helper eliminates the duplication.

Issue 6 (Low/FYI) — Dual-embedding task branch unreachable from dispatcher

File: opencontractserver/utils/subtree_groups.py, _dispatch_relationship_embeddings

The dispatcher always passes an explicit embedder_path to calculate_embeddings_for_relationship_batch, so the task's dual-embedding branch (embedder_path=None) is dead code from this call site. Dual-embedding is achieved by dispatching two separate tasks instead. Functionally correct; worth a docstring note so future maintainers understand the branch exists for direct callers, not the dispatcher.

Issue 7 (Nitpick) — similarity_top_k = limit + offset vs. cap

limit is capped at SEMANTIC_SEARCH_MAX_RESULTS but similarity_top_k=limit+offset is uncapped. With a large offset the underlying vector scan can significantly exceed the declared cap. Matches the existing annotation resolver pattern; fine in practice.

Summary

The core feature is solid and the security model is sound. The two items most worth addressing before merge are the raw-PK ID convention (long-lived API footgun) and auditing the store_embedding exactly-one-parent guard for existing callers. Everything else is low-priority polish. Test coverage is strong. The BaseVectorStore extraction is a genuine improvement.

JSv4 added 3 commits May 17, 2026 09:27
- Make _attach_block_context_sync apply Relationship.visible_to_user
  internally so block-context exposure is mechanical, not caller-policed
- Sort source/target annotations by id in _shape_results so block_text
  mirrors synthesize_relationship_block_text for multi-source future
- Extract _block_context_to_dict helper to drop duplicated payload
  building between sync and async pydantic-ai response builders
- Document that calculate_embeddings_for_relationship_batch's no-path
  dual-embedding branch is for direct callers, not the dispatcher
Captures CorpusAnnotationCards browse-mode rendering — provides a stable
visual baseline for the semantic search annotations card surface that
this PR rewires for block-context deep linking.
@claude
Copy link
Copy Markdown

claude Bot commented May 17, 2026

Code Review: Surface OC_SUBTREE_GROUP Relationships in Semantic Search (#1645)

Well-executed feature PR. The architecture is clean, security is properly enforced end-to-end, and the test coverage is comprehensive. Below are observations ranging from minor clarifications to small improvement opportunities — no blocking issues found.


Security ✅

  • IDOR prevention is properly layered: CoreRelationshipVectorStore._build_visible_relationship_qs() uses Relationship.objects.visible_to_user(), resolve_semantic_search_relationships() mirrors the annotation resolver's empty-list response for both "not found" and "no permission" cases, and _attach_block_context_sync() adds a second visible_to_user() gate on the annotation-hit post-pass. Defense-in-depth is correct here.
  • Enumeration prevention: Returning an identical empty list for unauthorized vs. non-existent corpus/document IDs correctly prevents timing or error-message enumeration attacks. Good.
  • ORM usage: No raw SQL, no string interpolation in queries. Django ORM used correctly throughout.

Performance ✅

  • Block-context deferred until after reranking: _attach_block_context_sync() runs after top_k is already applied, so the extra DB round-trip is bounded. Good design decision.
  • Single IN-queries: One Annotation.objects.filter(id__in=...).values_list("id", "raw_text") for text assembly, one prefetch for source/target annotations on the relationship queryset. No N+1 issues observed.
  • Batch embedding dispatch: _dispatch_relationship_embeddings() deduplicates by (embedder_path, corpus_id) so multi-corpus documents don't enqueue duplicate tasks.

Potential Issues / Minor Observations

1. Implicit assumption: single source annotation per OC_SUBTREE_GROUP (core_vector_stores.py)

source_id = next(iter(rel.source_annotations.all()), None)

OC_SUBTREE_GROUP relationships currently always have exactly one source, but this assumption is implicit. If a malformed row ever slips through (e.g. a future materialiser bug), the code silently picks the first source with no warning. Consider adding a debug-level log or, at minimum, a comment stating the invariant. A cheap assert in tests would lock it down.

2. from_global_id() unpacking not guarded (config/graphql/search_queries.py)

_, corpus_pk = from_global_id(corpus_id)

If a caller passes a malformed or wrong-type Relay ID, from_global_id() raises binascii.Error or returns an unexpected type. The annotation resolver uses the same pattern so this is consistent, but the subsequent .filter(...).exists() fallback only saves you from a missing object — not from the decode exception itself. Wrapping in a try/except with an early return [] would be more defensive and consistent with the IDOR empty-response contract.

3. _dispatch_relationship_embeddings() — label_id not validated (utils/subtree_groups.py)

The dispatch function accepts a label_id and assumes it corresponds to OC_SUBTREE_GROUP_LABEL_NAME. If a caller ever passes a wrong label ID (e.g. copy-paste from an adjacent call site), tasks will be dispatched for unintended rows with no obvious error. A one-line guard like:

assert AnnotationLabel.objects.filter(pk=label_id, text=OC_SUBTREE_GROUP_LABEL_NAME).exists()

(or raise ValueError in production code) would catch the category of bugs this could produce.

4. Block-context dict serialization is duplicated in two places

pydantic_ai_vector_stores.py:from_core_results() and config/graphql/search_queries.py:resolve_semantic_search() both independently construct the same block_context dict from the BlockContext dataclass fields. Since BlockContext is a dataclass, the simplest consolidation is dataclasses.asdict(bc) or a to_dict() method on the dataclass. Not urgent, but worth closing before this pattern spreads.

5. Error messages in calculate_embeddings_for_relationship_batch() lack relationship ID context (tasks/embeddings_task.py)

When embedding fails, the log message uses the generic exception e without identifying which relationship failed. The annotation equivalent includes the annotation PK. Adding relationship.pk to the error log would make production debugging significantly easier.

6. Test gap: CoreRelationshipVectorStore IDOR at store level

The GraphQL resolver tests (test_subtree_group_vector_search.py) cover IDOR denial at the resolver layer. The store's own visible_to_user() gate is tested indirectly but not explicitly. A short test that creates a relationship visible only to other_user, calls store.search() as self.user, and asserts the result list is empty would give explicit coverage of the store's enforcement independent of the resolver.


Nits

  • synthesize_relationship_block_text() calls join_block_text_parts() which enforces SUBTREE_GROUP_BLOCK_TEXT_MAX_CHARS. The cap comment in the CHANGELOG says "16k chars" but the constant is 16384 — technically correct, but documenting the exact value (16384) rather than the approximation would be more precise.
  • useJumpToRelationship.ts: The comment on the hover-state set (hoveredAnnotationId) could note that hover is intentionally set only after a successful scroll (not on the first retry attempt that fails), to prevent a reader from thinking the guard is a bug.

Summary

Category Status
Security / IDOR ✅ Properly layered
N+1 / Performance ✅ No issues
Error handling ✅ Good, with minor gaps noted
Test coverage ✅ Comprehensive
DRY / SRP ✅ Mostly good (one minor duplication)
Blocking bugs ❌ None found

The feature is solid. The notes above are mostly hardening opportunities rather than correctness issues. The most actionable items are (2) guarding from_global_id() decoding and (5) adding relationship PKs to error log messages.

@JSv4 JSv4 merged commit 25db331 into main May 17, 2026
20 checks passed
@JSv4 JSv4 deleted the claude/vector-search-relationships-d6CAR branch May 17, 2026 16:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Surface relationships (and subtree groups) in semantic search

2 participants