Skip to content

[f] VER-267: Adaptive retrieval for searching related snippets#23

Merged
quancao-ea merged 3 commits intomainfrom
features/adaptive-retrieval-for-searching-related-snippets
Oct 6, 2025
Merged

[f] VER-267: Adaptive retrieval for searching related snippets#23
quancao-ea merged 3 commits intomainfrom
features/adaptive-retrieval-for-searching-related-snippets

Conversation

@quancao-ea
Copy link
Copy Markdown
Collaborator

@quancao-ea quancao-ea commented Oct 4, 2025

Important

Enhances search_related_snippets_public with sub-vector filtering and indexing for efficient snippet retrieval.

  • Behavior:
    • search_related_snippets_public function now uses sub-vectors for initial filtering of snippets, improving search efficiency.
    • Introduces candidate_multiplier parameter to control the number of candidates considered.
  • Functions:
    • Adds sub_vector function to extract and normalize sub-vectors from embeddings.
  • Indexes:
    • Creates snippet_embeddings_sub_vector_idx index for efficient sub-vector retrieval using HNSW.

This description was created by Ellipsis for 9aa9509. You can customize this summary. It will automatically update as commits are pushed.

Summary by CodeRabbit

  • New Features
    • Enhanced “Related Snippets” search with improved relevance and more precise matching.
  • Performance
    • Faster results thanks to vector optimizations and indexing for high-dimensional embeddings.
  • Reliability
    • More robust handling of edge cases; returns an empty result when no reference data is available instead of failing.
  • Refactor
    • Streamlined result assembly for leaner responses and reduced overhead.

@quancao-ea quancao-ea self-assigned this Oct 4, 2025
@linear
Copy link
Copy Markdown

linear Bot commented Oct 4, 2025

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @quancao-ea, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces an adaptive retrieval mechanism to enhance the efficiency and accuracy of searching for related snippets. It refines the search process by first identifying a broad set of potential candidates using a partial vector comparison, then narrowing down to the most relevant results with a full vector similarity check. This optimization aims to provide a more performant and scalable solution for finding related content by leveraging specialized indexing and a new utility function for vector manipulation.

Highlights

  • Adaptive Retrieval Strategy: Implemented a two-stage adaptive retrieval strategy within the search_related_snippets_public function. This approach first identifies a larger pool of potential candidates using a faster, partial vector comparison, and then refines the selection with a full vector similarity check.
  • Sub-vector Indexing: Introduced a new HNSW index on the snippet_embeddings table, specifically targeting 512-dimension sub-vectors. This index (snippet_embeddings_sub_vector_idx) is crucial for accelerating the initial candidate selection phase of the adaptive retrieval.
  • New sub_vector Function: Added a new PL/pgSQL function named sub_vector which is responsible for extracting a specified number of dimensions from a given vector and normalizing it. This utility function is integral to the adaptive retrieval process.
  • Performance Optimization: The overall changes are designed to significantly improve the performance and scalability of related snippet searches by optimizing the vector similarity comparison process.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Oct 4, 2025

Walkthrough

Adds a new sub-vector helper function, a corresponding HNSW index on sub-vectors, and rewrites search_related_snippets_public to use sub-vector similarity for candidate selection, then full-embedding filtering, with an added candidate_multiplier parameter and revised JSON aggregation.

Changes

Cohort / File(s) Summary
Adaptive retrieval search function update
supabase/database/sql/search_related_snippets_public.sql
Updates function signature to include candidate_multiplier; introduces sub-vector (512) usage for initial similarity; adds NULL guard; restructures CTEs; revises joins and JSON aggregation; maintains thresholding on full embedding with match_count limit.
Sub-vector helper function
supabase/database/sql/sub_vector.sql
Adds IMMUTABLE function sub_vector(extensions.vector, INT) to extract and L2-normalize the leading dimensions of a vector with validation.
Index for sub-vector similarity
supabase/database/sql/snippet_embeddings_sub_vector_idx.sql
Creates HNSW index on sub_vector(embedding, 512)::vector(512) using vector_ip_ops with m=32, ef_construction=400 to accelerate sub-vector similarity search.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor Client
  participant DB as Postgres
  participant F as search_related_snippets_public
  participant SE as snippet_embeddings
  participant S as snippets
  participant LS as label_summary

  Client->>DB: CALL search_related_snippets_public(snippet_id, p_language, match_threshold, match_count, candidate_multiplier)
  DB->>F: Execute function
  Note over F: Fetch source embedding and compute source_sub_embedding (512)
  alt source_embedding is NULL
    F-->>Client: Return []
  else
    F->>SE: Sub-vector similarity search (HNSW on sub_vector(...,512))
    Note over SE,F: Select top (match_count * candidate_multiplier) by sub-vector distance
    F->>SE: Filter by full embedding similarity > match_threshold<br/>Order by full distance, LIMIT match_count
    F->>S: Join snippet metadata
    F->>LS: LEFT JOIN label_summary
    F-->>Client: jsonb_agg(final_snippets)
  end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Suggested reviewers

  • nhphong

Poem

I hop through vectors, light and fleet,
Trim to 512—so swift, so neat!
First I sniff, then take a leap,
Filter deep for matches to keep.
With HNSW trails I swiftly dart—
Related snippets, nose to heart. 🐇✨

Pre-merge checks and finishing touches

✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The title clearly describes the main change—adding adaptive retrieval for searching related snippets—and aligns directly with the pull request’s purpose, making it easy for teammates to understand the primary enhancement at a glance. Including the issue reference and feature tag does not obscure the meaning but could be seen as extraneous metadata rather than part of the core description.
Linked Issues Check ✅ Passed The changes implement the adaptive retrieval mechanism defined in VER-267 by introducing a candidate_multiplier parameter, adding the sub_vector function, updating the search_related_snippets_public function to use sub-embeddings, and creating an appropriate index, thereby fulfilling the core objectives of enhanced snippet search.
Out of Scope Changes Check ✅ Passed All modifications—including the new sub_vector utility, index creation, and updates to the search_related_snippets_public function—directly support the adaptive retrieval feature outlined in VER-267, and there are no unrelated or extraneous changes present.
Docstring Coverage ✅ Passed No functions found in the changes. Docstring coverage check skipped.
✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch features/adaptive-retrieval-for-searching-related-snippets

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@ellipsis-dev ellipsis-dev Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Changes requested ❌

Reviewed everything up to 9aa9509 in 1 minute and 24 seconds. Click for details.
  • Reviewed 151 lines of code in 3 files
  • Skipped 0 files when reviewing.
  • Skipped posting 2 draft comments. View those below.
  • Modify your settings and rules to customize what types of comments Ellipsis leaves. And don't forget to react with 👍 or 👎 to teach Ellipsis.
1. supabase/database/sql/search_related_snippets_public.sql:43
  • Draft comment:
    Clarify the use of '-(embedding <#> source_embedding) > match_threshold'. Ensure that the <#> operator returns the negative inner product (or equivalent) so that negating it yields a proper similarity score. A brief comment on this assumption would aid future maintainers.
  • Reason this comment was not posted:
    Comment looked like it was already resolved.
2. supabase/database/sql/sub_vector.sql:24
  • Draft comment:
    Ensure the subvector elements maintain their original order. Modify the unnest query to also select the ordinal index and use an ORDER BY clause in the ARRAY_AGG to guarantee correct ordering.
  • Reason this comment was not posted:
    Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 10% vs. threshold = 50% 1. PostgreSQL maintains order by default in array_agg without ORDER BY. 2. The unnest WITH ORDINALITY already preserves order in the CTE. 3. There's no JOIN or other operation that would disturb the order. 4. The comment starts with "Ensure that..." which is a red flag. 5. The current code already maintains order correctly. Could there be edge cases where PostgreSQL's default ordering behavior changes? Could future PostgreSQL versions change this behavior? While PostgreSQL behavior could theoretically change, we should trust the current documented behavior and not add unnecessary complexity for hypothetical future changes. The comment should be deleted as it suggests adding unnecessary complexity to solve a non-existent problem, and it starts with "Ensure that..." which violates our commenting rules.

Workflow ID: wflow_YC3kEbZoBolL9qQB

You can customize Ellipsis by changing your verbosity settings, reacting with 👍 or 👎, replying to comments, or adding code review rules.

),
norm(factor) AS (
SELECT
sqrt(sum(pow(elem, 2)))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a check for zero-norm vectors to avoid division-by-zero errors when computing normalization.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request enhances snippet searching by implementing a two-stage adaptive retrieval strategy using sub-vectors, which is a solid approach for improving performance. The changes are well-structured, introducing a new sub_vector function, a corresponding HNSW index, and updating the main search function. My review includes a few points to improve the implementation further: I've pointed out a potential division-by-zero error in the new sub_vector function, suggested addressing a hardcoded value for better maintainability, and raised a concern about the removal of a statement timeout which could impact database stability.

RETURN COALESCE(result, '[]'::jsonb);
END;
$$ LANGUAGE plpgsql
SET statement_timeout TO '30s';
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The SET statement_timeout TO '30s' has been removed. Vector searches can be resource-intensive, and removing the timeout could allow for long-running queries that might degrade database performance or stability. It is advisable to retain a statement timeout as a safeguard unless it is being managed at a higher level.

unnormed
)
SELECT
array_agg(u.elem / r.factor)::extensions.vector
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The current implementation does not handle the edge case where the input sub-vector has a norm of 0 (i.e., it's a zero vector). In this situation, r.factor will be 0, causing a division-by-zero error. To make the function more robust, you should handle this case. A simple fix is to use NULLIF to handle the zero norm, and COALESCE to default to a divisor of 1, which will correctly produce a zero vector.

      array_agg(u.elem / COALESCE(NULLIF(r.factor, 0), 1))::extensions.vector

Comment on lines +16 to +17
SELECT embedding, sub_vector(embedding, 512)::vector(512)
INTO source_embedding, source_sub_embedding
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The sub-vector dimension 512 is hardcoded here and also on line 37. This value is also used to define the index in snippet_embeddings_sub_vector_idx.sql. Using a magic number in multiple places makes the code harder to maintain. If you need to change the dimension, you'll have to update it in several locations. Consider declaring a variable for this dimension at the beginning of the function body to centralize it.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 8faff5b and 9aa9509.

📒 Files selected for processing (3)
  • supabase/database/sql/search_related_snippets_public.sql (3 hunks)
  • supabase/database/sql/snippet_embeddings_sub_vector_idx.sql (1 hunks)
  • supabase/database/sql/sub_vector.sql (1 hunks)

Comment on lines +84 to 86
SELECT jsonb_agg(fs)
INTO result
FROM final_snippets fs;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Restore deterministic similarity order in the JSON payload.

jsonb_agg(fs) without an ORDER BY loses the similarity ordering enforced inside similar_snippets, so callers can now receive results in arbitrary order. For search this is a functional regression.

Patch suggestion:

@@
-    final_snippets AS (
+    final_snippets AS (
         SELECT
             s.id,
             s.title,
             s.file_path,
             s.recorded_at,
             s.comment_count,
             s.start_time,
             CASE
                 WHEN p_language = 'spanish' THEN s.summary ->> 'spanish'
                 ELSE s.summary ->> 'english'
             END AS summary,
             a.radio_station_name,
             a.radio_station_code,
             a.location_state,
-            COALESCE(ls.labels, '[]'::jsonb) AS labels
+            COALESCE(ls.labels, '[]'::jsonb) AS labels,
+            -(s.embedding <#> source_embedding) AS similarity
         FROM similar_snippets s
         JOIN audio_files a ON a.id = s.audio_file
         LEFT JOIN label_summary ls ON ls.snippet = s.id
     )
-    SELECT jsonb_agg(fs)
+    SELECT jsonb_agg(to_jsonb(fs) - 'similarity' ORDER BY fs.similarity DESC)
     INTO result
     FROM final_snippets fs;

This keeps the payload identical while guaranteeing the top-N remain sorted by their full-embedding similarity.

🤖 Prompt for AI Agents
In supabase/database/sql/search_related_snippets_public.sql around lines 84 to
86, jsonb_agg(fs) discards the similarity ordering enforced earlier and can
return results in arbitrary order; change the aggregation to preserve order (for
example aggregate from an ordered subquery or use jsonb_agg(fs ORDER BY
similarity DESC)) so the JSON payload keeps the top-N sorted by full-embedding
similarity while leaving the payload shape unchanged.

@quancao-ea quancao-ea requested a review from nhphong October 6, 2025 02:26
@quancao-ea quancao-ea merged commit 6abed01 into main Oct 6, 2025
2 checks passed
@quancao-ea quancao-ea deleted the features/adaptive-retrieval-for-searching-related-snippets branch October 24, 2025 09:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants