[f] VER-267: Adaptive retrieval for searching related snippets#23
Conversation
Summary of ChangesHello @quancao-ea, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces an adaptive retrieval mechanism to enhance the efficiency and accuracy of searching for related snippets. It refines the search process by first identifying a broad set of potential candidates using a partial vector comparison, then narrowing down to the most relevant results with a full vector similarity check. This optimization aims to provide a more performant and scalable solution for finding related content by leveraging specialized indexing and a new utility function for vector manipulation. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
WalkthroughAdds a new sub-vector helper function, a corresponding HNSW index on sub-vectors, and rewrites search_related_snippets_public to use sub-vector similarity for candidate selection, then full-embedding filtering, with an added candidate_multiplier parameter and revised JSON aggregation. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
actor Client
participant DB as Postgres
participant F as search_related_snippets_public
participant SE as snippet_embeddings
participant S as snippets
participant LS as label_summary
Client->>DB: CALL search_related_snippets_public(snippet_id, p_language, match_threshold, match_count, candidate_multiplier)
DB->>F: Execute function
Note over F: Fetch source embedding and compute source_sub_embedding (512)
alt source_embedding is NULL
F-->>Client: Return []
else
F->>SE: Sub-vector similarity search (HNSW on sub_vector(...,512))
Note over SE,F: Select top (match_count * candidate_multiplier) by sub-vector distance
F->>SE: Filter by full embedding similarity > match_threshold<br/>Order by full distance, LIMIT match_count
F->>S: Join snippet metadata
F->>LS: LEFT JOIN label_summary
F-->>Client: jsonb_agg(final_snippets)
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested reviewers
Poem
Pre-merge checks and finishing touches✅ Passed checks (5 passed)
✨ Finishing touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Caution
Changes requested ❌
Reviewed everything up to 9aa9509 in 1 minute and 24 seconds. Click for details.
- Reviewed
151lines of code in3files - Skipped
0files when reviewing. - Skipped posting
2draft comments. View those below. - Modify your settings and rules to customize what types of comments Ellipsis leaves. And don't forget to react with 👍 or 👎 to teach Ellipsis.
1. supabase/database/sql/search_related_snippets_public.sql:43
- Draft comment:
Clarify the use of '-(embedding <#> source_embedding) > match_threshold'. Ensure that the <#> operator returns the negative inner product (or equivalent) so that negating it yields a proper similarity score. A brief comment on this assumption would aid future maintainers. - Reason this comment was not posted:
Comment looked like it was already resolved.
2. supabase/database/sql/sub_vector.sql:24
- Draft comment:
Ensure the subvector elements maintain their original order. Modify the unnest query to also select the ordinal index and use an ORDER BY clause in the ARRAY_AGG to guarantee correct ordering. - Reason this comment was not posted:
Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 10% vs. threshold = 50% 1. PostgreSQL maintains order by default in array_agg without ORDER BY. 2. The unnest WITH ORDINALITY already preserves order in the CTE. 3. There's no JOIN or other operation that would disturb the order. 4. The comment starts with "Ensure that..." which is a red flag. 5. The current code already maintains order correctly. Could there be edge cases where PostgreSQL's default ordering behavior changes? Could future PostgreSQL versions change this behavior? While PostgreSQL behavior could theoretically change, we should trust the current documented behavior and not add unnecessary complexity for hypothetical future changes. The comment should be deleted as it suggests adding unnecessary complexity to solve a non-existent problem, and it starts with "Ensure that..." which violates our commenting rules.
Workflow ID: wflow_YC3kEbZoBolL9qQB
You can customize by changing your verbosity settings, reacting with 👍 or 👎, replying to comments, or adding code review rules.
| ), | ||
| norm(factor) AS ( | ||
| SELECT | ||
| sqrt(sum(pow(elem, 2))) |
There was a problem hiding this comment.
Add a check for zero-norm vectors to avoid division-by-zero errors when computing normalization.
There was a problem hiding this comment.
Code Review
This pull request enhances snippet searching by implementing a two-stage adaptive retrieval strategy using sub-vectors, which is a solid approach for improving performance. The changes are well-structured, introducing a new sub_vector function, a corresponding HNSW index, and updating the main search function. My review includes a few points to improve the implementation further: I've pointed out a potential division-by-zero error in the new sub_vector function, suggested addressing a hardcoded value for better maintainability, and raised a concern about the removal of a statement timeout which could impact database stability.
| RETURN COALESCE(result, '[]'::jsonb); | ||
| END; | ||
| $$ LANGUAGE plpgsql | ||
| SET statement_timeout TO '30s'; |
There was a problem hiding this comment.
The SET statement_timeout TO '30s' has been removed. Vector searches can be resource-intensive, and removing the timeout could allow for long-running queries that might degrade database performance or stability. It is advisable to retain a statement timeout as a safeguard unless it is being managed at a higher level.
| unnormed | ||
| ) | ||
| SELECT | ||
| array_agg(u.elem / r.factor)::extensions.vector |
There was a problem hiding this comment.
The current implementation does not handle the edge case where the input sub-vector has a norm of 0 (i.e., it's a zero vector). In this situation, r.factor will be 0, causing a division-by-zero error. To make the function more robust, you should handle this case. A simple fix is to use NULLIF to handle the zero norm, and COALESCE to default to a divisor of 1, which will correctly produce a zero vector.
array_agg(u.elem / COALESCE(NULLIF(r.factor, 0), 1))::extensions.vector| SELECT embedding, sub_vector(embedding, 512)::vector(512) | ||
| INTO source_embedding, source_sub_embedding |
There was a problem hiding this comment.
The sub-vector dimension 512 is hardcoded here and also on line 37. This value is also used to define the index in snippet_embeddings_sub_vector_idx.sql. Using a magic number in multiple places makes the code harder to maintain. If you need to change the dimension, you'll have to update it in several locations. Consider declaring a variable for this dimension at the beginning of the function body to centralize it.
There was a problem hiding this comment.
Actionable comments posted: 1
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Disabled knowledge base sources:
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (3)
supabase/database/sql/search_related_snippets_public.sql(3 hunks)supabase/database/sql/snippet_embeddings_sub_vector_idx.sql(1 hunks)supabase/database/sql/sub_vector.sql(1 hunks)
| SELECT jsonb_agg(fs) | ||
| INTO result | ||
| FROM final_snippets fs; |
There was a problem hiding this comment.
Restore deterministic similarity order in the JSON payload.
jsonb_agg(fs) without an ORDER BY loses the similarity ordering enforced inside similar_snippets, so callers can now receive results in arbitrary order. For search this is a functional regression.
Patch suggestion:
@@
- final_snippets AS (
+ final_snippets AS (
SELECT
s.id,
s.title,
s.file_path,
s.recorded_at,
s.comment_count,
s.start_time,
CASE
WHEN p_language = 'spanish' THEN s.summary ->> 'spanish'
ELSE s.summary ->> 'english'
END AS summary,
a.radio_station_name,
a.radio_station_code,
a.location_state,
- COALESCE(ls.labels, '[]'::jsonb) AS labels
+ COALESCE(ls.labels, '[]'::jsonb) AS labels,
+ -(s.embedding <#> source_embedding) AS similarity
FROM similar_snippets s
JOIN audio_files a ON a.id = s.audio_file
LEFT JOIN label_summary ls ON ls.snippet = s.id
)
- SELECT jsonb_agg(fs)
+ SELECT jsonb_agg(to_jsonb(fs) - 'similarity' ORDER BY fs.similarity DESC)
INTO result
FROM final_snippets fs;This keeps the payload identical while guaranteeing the top-N remain sorted by their full-embedding similarity.
🤖 Prompt for AI Agents
In supabase/database/sql/search_related_snippets_public.sql around lines 84 to
86, jsonb_agg(fs) discards the similarity ordering enforced earlier and can
return results in arbitrary order; change the aggregation to preserve order (for
example aggregate from an ordered subquery or use jsonb_agg(fs ORDER BY
similarity DESC)) so the JSON payload keeps the top-N sorted by full-embedding
similarity while leaving the payload shape unchanged.
Important
Enhances
search_related_snippets_publicwith sub-vector filtering and indexing for efficient snippet retrieval.search_related_snippets_publicfunction now uses sub-vectors for initial filtering of snippets, improving search efficiency.candidate_multiplierparameter to control the number of candidates considered.sub_vectorfunction to extract and normalize sub-vectors from embeddings.snippet_embeddings_sub_vector_idxindex for efficient sub-vector retrieval using HNSW.This description was created by
for 9aa9509. You can customize this summary. It will automatically update as commits are pushed.
Summary by CodeRabbit