[f] VER-267: Adaptive retrieval for searching related snippets by quancao-ea · Pull Request #23 · PublicDataWorks/verdad

quancao-ea · 2025-10-04T11:58:00Z

Important

Enhances search_related_snippets_public with sub-vector filtering and indexing for efficient snippet retrieval.

Behavior:
- search_related_snippets_public function now uses sub-vectors for initial filtering of snippets, improving search efficiency.
- Introduces candidate_multiplier parameter to control the number of candidates considered.
Functions:
- Adds sub_vector function to extract and normalize sub-vectors from embeddings.
Indexes:
- Creates snippet_embeddings_sub_vector_idx index for efficient sub-vector retrieval using HNSW.

^{This description was created by}^{for 9aa9509. You can customize this summary. It will automatically update as commits are pushed.}

Summary by CodeRabbit

New Features
- Enhanced “Related Snippets” search with improved relevance and more precise matching.
Performance
- Faster results thanks to vector optimizations and indexing for high-dimensional embeddings.
Reliability
- More robust handling of edge cases; returns an empty result when no reference data is available instead of failing.
Refactor
- Streamlined result assembly for leaner responses and reduced overhead.

linear · 2025-10-04T11:58:02Z

VER-267 Implement Adaptive Retrieval for searching related snippets

gemini-code-assist · 2025-10-04T11:58:15Z

Summary of Changes

Hello @quancao-ea, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces an adaptive retrieval mechanism to enhance the efficiency and accuracy of searching for related snippets. It refines the search process by first identifying a broad set of potential candidates using a partial vector comparison, then narrowing down to the most relevant results with a full vector similarity check. This optimization aims to provide a more performant and scalable solution for finding related content by leveraging specialized indexing and a new utility function for vector manipulation.

Highlights

Adaptive Retrieval Strategy: Implemented a two-stage adaptive retrieval strategy within the search_related_snippets_public function. This approach first identifies a larger pool of potential candidates using a faster, partial vector comparison, and then refines the selection with a full vector similarity check.
Sub-vector Indexing: Introduced a new HNSW index on the snippet_embeddings table, specifically targeting 512-dimension sub-vectors. This index (snippet_embeddings_sub_vector_idx) is crucial for accelerating the initial candidate selection phase of the adaptive retrieval.
New sub_vector Function: Added a new PL/pgSQL function named sub_vector which is responsible for extracting a specified number of dimensions from a given vector and normalizing it. This utility function is integral to the adaptive retrieval process.
Performance Optimization: The overall changes are designed to significantly improve the performance and scalability of related snippet searches by optimizing the vector similarity comparison process.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

coderabbitai · 2025-10-04T11:58:30Z

Walkthrough

Adds a new sub-vector helper function, a corresponding HNSW index on sub-vectors, and rewrites search_related_snippets_public to use sub-vector similarity for candidate selection, then full-embedding filtering, with an added candidate_multiplier parameter and revised JSON aggregation.

Changes

Cohort / File(s)	Summary
Adaptive retrieval search function update `supabase/database/sql/search_related_snippets_public.sql`	Updates function signature to include candidate_multiplier; introduces sub-vector (512) usage for initial similarity; adds NULL guard; restructures CTEs; revises joins and JSON aggregation; maintains thresholding on full embedding with match_count limit.
Sub-vector helper function `supabase/database/sql/sub_vector.sql`	Adds IMMUTABLE function sub_vector(extensions.vector, INT) to extract and L2-normalize the leading dimensions of a vector with validation.
Index for sub-vector similarity `supabase/database/sql/snippet_embeddings_sub_vector_idx.sql`	Creates HNSW index on sub_vector(embedding, 512)::vector(512) using vector_ip_ops with m=32, ef_construction=400 to accelerate sub-vector similarity search.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor Client
  participant DB as Postgres
  participant F as search_related_snippets_public
  participant SE as snippet_embeddings
  participant S as snippets
  participant LS as label_summary

  Client->>DB: CALL search_related_snippets_public(snippet_id, p_language, match_threshold, match_count, candidate_multiplier)
  DB->>F: Execute function
  Note over F: Fetch source embedding and compute source_sub_embedding (512)
  alt source_embedding is NULL
    F-->>Client: Return []
  else
    F->>SE: Sub-vector similarity search (HNSW on sub_vector(...,512))
    Note over SE,F: Select top (match_count * candidate_multiplier) by sub-vector distance
    F->>SE: Filter by full embedding similarity > match_threshold<br/>Order by full distance, LIMIT match_count
    F->>S: Join snippet metadata
    F->>LS: LEFT JOIN label_summary
    F-->>Client: jsonb_agg(final_snippets)
  end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

VER-265 - Optimize fetch related snippets #21 — Refactors the same function’s CTE structure and final JSON assembly, overlapping with this rewrite.
[f] VER-261 - Optimize get snippets with labels #18 — Touches search_related_snippets_public; this PR’s updates intersect directly with its removal/modification.
VER-254 - Related snippet for public view #15 — Prior modifications to embedding-based similarity in the same function, related to introducing sub-vector handling.

Suggested reviewers

nhphong

Poem

I hop through vectors, light and fleet,
Trim to 512—so swift, so neat!
First I sniff, then take a leap,
Filter deep for matches to keep.
With HNSW trails I swiftly dart—
Related snippets, nose to heart. 🐇✨

Pre-merge checks and finishing touches

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check	✅ Passed	The title clearly describes the main change—adding adaptive retrieval for searching related snippets—and aligns directly with the pull request’s purpose, making it easy for teammates to understand the primary enhancement at a glance. Including the issue reference and feature tag does not obscure the meaning but could be seen as extraneous metadata rather than part of the core description.
Linked Issues Check	✅ Passed	The changes implement the adaptive retrieval mechanism defined in VER-267 by introducing a candidate_multiplier parameter, adding the sub_vector function, updating the search_related_snippets_public function to use sub-embeddings, and creating an appropriate index, thereby fulfilling the core objectives of enhanced snippet search.
Out of Scope Changes Check	✅ Passed	All modifications—including the new sub_vector utility, index creation, and updates to the search_related_snippets_public function—directly support the adaptive retrieval feature outlined in VER-267, and there are no unrelated or extraneous changes present.
Docstring Coverage	✅ Passed	No functions found in the changes. Docstring coverage check skipped.

✨ Finishing touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch features/adaptive-retrieval-for-searching-related-snippets

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

ellipsis-dev

Caution

Changes requested ❌

Reviewed everything up to 9aa9509 in 1 minute and 24 seconds. Click for details.

Reviewed 151 lines of code in 3 files
Skipped 0 files when reviewing.
Skipped posting 2 draft comments. View those below.
Modify your settings and rules to customize what types of comments Ellipsis leaves. And don't forget to react with 👍 or 👎 to teach Ellipsis.

1. supabase/database/sql/search_related_snippets_public.sql:43

Draft comment:
Clarify the use of '-(embedding <#> source_embedding) > match_threshold'. Ensure that the <#> operator returns the negative inner product (or equivalent) so that negating it yields a proper similarity score. A brief comment on this assumption would aid future maintainers.
Reason this comment was not posted:
Comment looked like it was already resolved.

2. supabase/database/sql/sub_vector.sql:24

Draft comment:
Ensure the subvector elements maintain their original order. Modify the unnest query to also select the ordinal index and use an ORDER BY clause in the ARRAY_AGG to guarantee correct ordering.
Reason this comment was not posted:
Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 10% vs. threshold = 50% 1. PostgreSQL maintains order by default in array_agg without ORDER BY. 2. The unnest WITH ORDINALITY already preserves order in the CTE. 3. There's no JOIN or other operation that would disturb the order. 4. The comment starts with "Ensure that..." which is a red flag. 5. The current code already maintains order correctly. Could there be edge cases where PostgreSQL's default ordering behavior changes? Could future PostgreSQL versions change this behavior? While PostgreSQL behavior could theoretically change, we should trust the current documented behavior and not add unnecessary complexity for hypothetical future changes. The comment should be deleted as it suggests adding unnecessary complexity to solve a non-existent problem, and it starts with "Ensure that..." which violates our commenting rules.

Workflow ID: wflow_YC3kEbZoBolL9qQB

^{You can customize}^{by changing your verbosity settings, reacting with 👍 or 👎, replying to comments, or adding code review rules.}

ellipsis-dev · 2025-10-04T11:59:27Z

+    ),
+    norm(factor) AS (
+      SELECT
+        sqrt(sum(pow(elem, 2)))


Add a check for zero-norm vectors to avoid division-by-zero errors when computing normalization.

gemini-code-assist

Code Review

This pull request enhances snippet searching by implementing a two-stage adaptive retrieval strategy using sub-vectors, which is a solid approach for improving performance. The changes are well-structured, introducing a new sub_vector function, a corresponding HNSW index, and updating the main search function. My review includes a few points to improve the implementation further: I've pointed out a potential division-by-zero error in the new sub_vector function, suggested addressing a hardcoded value for better maintainability, and raised a concern about the removal of a statement timeout which could impact database stability.

gemini-code-assist · 2025-10-04T11:59:32Z

    RETURN COALESCE(result, '[]'::jsonb);
 END;
-$$ LANGUAGE plpgsql
-SET statement_timeout TO '30s';


The SET statement_timeout TO '30s' has been removed. Vector searches can be resource-intensive, and removing the timeout could allow for long-running queries that might degrade database performance or stability. It is advisable to retain a statement timeout as a safeguard unless it is being managed at a higher level.

gemini-code-assist · 2025-10-04T11:59:33Z

+        unnormed
+    )
+    SELECT
+      array_agg(u.elem / r.factor)::extensions.vector


The current implementation does not handle the edge case where the input sub-vector has a norm of 0 (i.e., it's a zero vector). In this situation, r.factor will be 0, causing a division-by-zero error. To make the function more robust, you should handle this case. A simple fix is to use NULLIF to handle the zero norm, and COALESCE to default to a divisor of 1, which will correctly produce a zero vector.

array_agg(u.elem / COALESCE(NULLIF(r.factor, 0), 1))::extensions.vector

gemini-code-assist · 2025-10-04T11:59:33Z

+    SELECT embedding, sub_vector(embedding, 512)::vector(512)
+    INTO source_embedding, source_sub_embedding


The sub-vector dimension 512 is hardcoded here and also on line 37. This value is also used to define the index in snippet_embeddings_sub_vector_idx.sql. Using a magic number in multiple places makes the code harder to maintain. If you need to change the dimension, you'll have to update it in several locations. Consider declaring a variable for this dimension at the beginning of the function body to centralize it.

coderabbitai

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 8faff5b and 9aa9509.

📒 Files selected for processing (3)

supabase/database/sql/search_related_snippets_public.sql (3 hunks)
supabase/database/sql/snippet_embeddings_sub_vector_idx.sql (1 hunks)
supabase/database/sql/sub_vector.sql (1 hunks)

coderabbitai · 2025-10-04T12:03:45Z

+    SELECT jsonb_agg(fs)
+    INTO result
    FROM final_snippets fs;


⚠️ Potential issue | 🟠 Major

Restore deterministic similarity order in the JSON payload.

jsonb_agg(fs) without an ORDER BY loses the similarity ordering enforced inside similar_snippets, so callers can now receive results in arbitrary order. For search this is a functional regression.

Patch suggestion:

@@ - final_snippets AS ( + final_snippets AS ( SELECT s.id, s.title, s.file_path, s.recorded_at, s.comment_count, s.start_time, CASE WHEN p_language = 'spanish' THEN s.summary ->> 'spanish' ELSE s.summary ->> 'english' END AS summary, a.radio_station_name, a.radio_station_code, a.location_state, - COALESCE(ls.labels, '[]'::jsonb) AS labels + COALESCE(ls.labels, '[]'::jsonb) AS labels, + -(s.embedding <#> source_embedding) AS similarity FROM similar_snippets s JOIN audio_files a ON a.id = s.audio_file LEFT JOIN label_summary ls ON ls.snippet = s.id ) - SELECT jsonb_agg(fs) + SELECT jsonb_agg(to_jsonb(fs) - 'similarity' ORDER BY fs.similarity DESC) INTO result FROM final_snippets fs;

This keeps the payload identical while guaranteeing the top-N remain sorted by their full-embedding similarity.

🤖 Prompt for AI Agents

In supabase/database/sql/search_related_snippets_public.sql around lines 84 to 86, jsonb_agg(fs) discards the similarity ordering enforced earlier and can return results in arbitrary order; change the aggregation to preserve order (for example aggregate from an ordered subquery or use jsonb_agg(fs ORDER BY similarity DESC)) so the JSON payload keeps the top-N sorted by full-embedding similarity while leaving the payload shape unchanged.

quancao-ea added 3 commits October 4, 2025 15:26

Add sub_vector SQL function to reduce vector dimensions

b846058

Create HNSW index for snippets embedding on lower dimensions

985cfef

Implement adaptive retrieval for searching related snippets

9aa9509

quancao-ea self-assigned this Oct 4, 2025

ellipsis-dev Bot reviewed Oct 4, 2025

View reviewed changes

gemini-code-assist Bot reviewed Oct 4, 2025

View reviewed changes

coderabbitai Bot reviewed Oct 4, 2025

View reviewed changes

quancao-ea requested a review from nhphong October 6, 2025 02:26

nhphong approved these changes Oct 6, 2025

View reviewed changes

quancao-ea merged commit 6abed01 into main Oct 6, 2025
2 checks passed

quancao-ea deleted the features/adaptive-retrieval-for-searching-related-snippets branch October 24, 2025 09:23

coderabbitai Bot mentioned this pull request Jan 15, 2026

fix: resolve search timeout by using HNSW index #54

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[f] VER-267: Adaptive retrieval for searching related snippets#23

[f] VER-267: Adaptive retrieval for searching related snippets#23
quancao-ea merged 3 commits intomainfrom
features/adaptive-retrieval-for-searching-related-snippets

quancao-ea commented Oct 4, 2025 •

edited by coderabbitai Bot

Loading

Uh oh!

linear Bot commented Oct 4, 2025

Uh oh!

gemini-code-assist Bot commented Oct 4, 2025

Uh oh!

coderabbitai Bot commented Oct 4, 2025 •

edited

Loading

Uh oh!

ellipsis-dev Bot left a comment

Uh oh!

ellipsis-dev Bot Oct 4, 2025

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Oct 4, 2025

Uh oh!

gemini-code-assist Bot Oct 4, 2025

Uh oh!

gemini-code-assist Bot Oct 4, 2025

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Oct 4, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		SELECT embedding, sub_vector(embedding, 512)::vector(512)
		INTO source_embedding, source_sub_embedding

Conversation

quancao-ea commented Oct 4, 2025 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

linear Bot commented Oct 4, 2025

Uh oh!

gemini-code-assist Bot commented Oct 4, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

coderabbitai Bot commented Oct 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Pre-merge checks and finishing touches

Uh oh!

ellipsis-dev Bot left a comment

Choose a reason for hiding this comment

Uh oh!

ellipsis-dev Bot Oct 4, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Oct 4, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Oct 4, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Oct 4, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Oct 4, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

quancao-ea commented Oct 4, 2025 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Oct 4, 2025 •

edited

Loading